<a href="https://colab.research.google.com/github/xc308/Large_Language_Model/blob/main/4_Pre_training_LLMs_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## pretraining large language models (LLMs) using the popular Hugging Face library.

- load pre-trained models from Hugging Face
- make inferences using the Pipeline module
- further train pre-trained LLMs on your own data (self-supervised fine-tuning).

In [1]:
!pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 torch=2.1.0+cu118


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -y


In [2]:
!pip install pmdarima -U
!pip install --upgrade pmdarima==2.0.2

Collecting pmdarima
  Using cached pmdarima-2.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Using cached pmdarima-2.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.2 MB)
Installing collected packages: pmdarima
  Attempting uninstall: pmdarima
    Found existing installation: pmdarima 2.0.2
    Uninstalling pmdarima-2.0.2:
      Successfully uninstalled pmdarima-2.0.2
Successfully installed pmdarima-2.0.4
Collecting pmdarima==2.0.2
  Using cached pmdarima-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Using cached pmdarima-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (1.9 MB)
Installing collected packages: pmdarima
  Attempting uninstall: pmdarima
    Found existing installation: pmdarima 2.0.4
    Uninstalling pmdarima-2.0.4:
      Successfully uninstalled pmdarima-2.0.4
Successfully installe

In [3]:
!pip install -U git+https://github.com/huggingface/transformers
!pip install --user datasets # 2.15.0
!pip install --user portalocker>=2.0.0
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U accelerate
!pip install --user torch==2.6.0
!pip install -U torchvision
!pip install --user protobuf==3.20.*

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-tae_rop2
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-tae_rop2
  Resolved https://github.com/huggingface/transformers to commit 4e63a1747ce6a4b5f75e8d2318857c2b76c3ba23
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tokenizers<0.22,>=0.21 (from transformers==4.52.0.dev0)
  Using cached tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Using cached tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers

In [4]:
!pip install --user dataset



In [5]:
from transformers import pipeline
from datasets import load_dataset

from tqdm.auto import tqdm
import math
import time
import os


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [6]:
# Set the environment variable TOKENIZERS_PARALLELISM to 'false'
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

##Pretraining and self-supervised fine-tuning

- Pretrained models can be further tuned by training them on domain-specific unlabeled data, which is known as self-supervised fine-tuning.

- the model can be fine-tuned on specific downstream tasks using labeled data, a process known as supervised fine-tuning, further improving its performance.



## Loading a pretrained model from Hugging Face and making an inference:

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

pipe = pipeline("text-generation", model=model,tokenizer=tokenizer)
print(pipe("This movie was really")[0]["generated_text"])

Device set to use cpu


This movie was really good. I was really surprised by how good it was.
I was


## Pre-training Objectives

Three commonly used pre-training objectives are
- masked language modeling (MLM): randomly masking some words in a sentence and training the model to predict the masked words based on the context provided by the surrounding words
- next sentence prediction (NSP):  training the model to predict whether two sentences are consecutive in the original text or randomly chosen from the corpus. learn sentence-level relationships and understand the coherence between sentences.
- next Ttoken prediction: the model is trained to predict the next token in a sequence of text.

## Self-supervised training of a BERT model

use the Hugging Face Transformers library, which provides pre-implemented BERT models and tools for pre-training

- Prepare the train dataset
- Train a Tokenizer
- Preprocess the dataset
- Pre-train BERT using an MLM task
- Evaluate the trained model

## Importing required datasets

In [8]:
# Load the datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

In [9]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


In [10]:
#check a sample record
dataset["train"][400]

{'text': " When Mason was injured in warm @-@ ups late in the year , Columbus was without an active goaltender on their roster . To remedy the situation , the team signed former University of Michigan goaltender Shawn Hunwick to a one @-@ day , amateur tryout contract . After being eliminated from the NCAA Tournament just days prior , Hunwick skipped an astronomy class and drove his worn down 2003 Ford Ranger to Columbus to make the game . He served as the back @-@ up to Allen York during the game , and the following day , he signed a contract for the remainder of the year . With Mason returning from injury , Hunwick was third on the team 's depth chart when an injury to York allowed Hunwick to remain as the back @-@ up for the final two games of the year . In the final game of the season , the Blue Jackets were leading the Islanders 7 – 3 with 2 : 33 remaining when , at the behest of his teammates , Head Coach Todd Richards put Hunwick in to finish the game . He did not face a shot . 

In [11]:
## downsize the training dataset, to run on cpu


In [12]:
dataset["train"] = dataset["train"].select([i for i in range(1000)])
dataset["test"] = dataset["test"].select([i for i in range(200)])

In [13]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 200
    })
    train: Dataset({
        features: ['text'],
        num_rows: 1000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


### Creating TextDataset objects for the training:

In [14]:
# Path to save the datasets to text files
output_file_train = "wikitext_dataset_train.txt"
output_file_test = "wikitext_dataset_test.txt"

# Open the output file in write mode
with open(output_file_train, "w", encoding="utf-8") as f:
    # Iterate over each example in the dataset
    for example in dataset["train"]:
        # Write the example text to the file
        f.write(example["text"] + "\n")

# Open the output file in write mode
with open(output_file_test, "w", encoding="utf-8") as f:
    # Iterate over each example in the dataset
    for example in dataset["test"]:
        # Write the example text to the file
        f.write(example["text"] + "\n")

## Define a tokenizer to be used for tokenizing the dataset.

In [15]:
# create a tokenizer from existing one to re-use special tokens
from transformers import BertTokenizerFast

bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

In [16]:
model_name = 'bert-base-uncased'

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, is_decoder=True)


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


In [17]:
len(dataset)

3

## Training a Tokenizer

- want to train the tokenizer on your own dataset

In [18]:
## create a python generator to dynamically load the data
def batch_iterator(batch_size=100):
    for i in tqdm(range(0, len(dataset), batch_size)):
        yield dataset['train'][i : i + batch_size]["text"]

## create a tokenizer from existing one to re-use special tokens
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

## train the tokenizer using our own dataset
bert_tokenizer = bert_tokenizer.train_new_from_iterator(text_iterator=batch_iterator(), vocab_size=30522)


  0%|          | 0/1 [00:00<?, ?it/s]

## Pretraining

Define the BERT Configuration
and create the model



- vocab_size=30522: Specifies the size of the vocabulary. This number should match the vocabulary size used by the tokenizer.
hidden_size=768: Sets the size of the hidden layers.
- num_hidden_layers=12: Determines the number of hidden layers in the transformer model.
- num_attention_heads=12: Sets the number of attention heads in each attention layer.
- intermediate_size=3072: Specifies the size of the "intermediate" (i.e., feed-forward) layer within the transformer.

In [19]:
# Define the BERT configuration

from transformers import BertConfig


config = BertConfig(
    vocab_size=len(bert_tokenizer.get_vocab()),  # Specify the vocabulary size(Make sure this number equals the vocab_size of the tokenizer)
    hidden_size=768,  # Set the hidden size
    num_hidden_layers=12,  # Set the number of layers
    num_attention_heads=12,  # Set the number of attention heads
    intermediate_size=3072,  # Set the intermediate size
)

In [20]:
# Create the BERT model for pre-training

from transformers import BertForMaskedLM
model = BertForMaskedLM(config)

In [21]:
# check model configuration
model

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(3569, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

## Tokenize Dataset Dynamically

- dynamically tokenize the dataset
- provides greater flexibility and integrates well with modern NLP workflows

## Tokenization Function

to preprocess the text data by tokenizing and formatting it for model training

In [22]:
# Tokenize dataset dynamically
def tokenize_function(examples):
    return bert_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)


# Tokenize train and test datasets
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])


# Print tokenized dataset sample
print(tokenized_datasets["train"][0])


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

{'input_ids': [2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [23]:
# Split into training and test sets
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]

In [24]:
train_dataset[0]

{'input_ids': [2,
  3,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  

## Prepare data for the Mask LM task masking random tokens

##Define the Data Collator for Language Modeling

- DataCollatorForLanguageModeling from the Hugging Face Transformers library.

- A data collator is used during training to dynamically create batches of data

- For language modeling, particularly for models like BERT that use masked language modeling (MLM), this collator prepares training batches by automatically masking tokens according to a specified probability.


In [25]:
from transformers import DataCollatorForLanguageModeling

# Prepare the data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer,  #
    mlm=True,                  # mask language model
    mlm_probability=0.15       # mask prob = 0.15
)

In [26]:
# check how collator transforms a sample input data record
data_collator([train_dataset[0]])

{'input_ids': tensor([[2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0,

In [27]:
!pip uninstall transformers -y
!pip install transformers==4.40.1

Found existing installation: transformers 4.52.0.dev0
Uninstalling transformers-4.52.0.dev0:
  Successfully uninstalled transformers-4.52.0.dev0
Collecting transformers==4.40.1
  Using cached transformers-4.40.1-py3-none-any.whl.metadata (137 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.40.1)
  Using cached tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Using cached transformers-4.40.1-py3-none-any.whl (9.0 MB)
Using cached tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.21.1
    Uninstalling tokenizers-0.21.1:
      Successfully uninstalled tokenizers-0.21.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transfo

In [35]:
# Define the training arguments
#!pip install transformers==4.28.1
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./trained_model",  # Specify the output directory for the trained model
    overwrite_output_dir=True,     # overwrite the contents of the output directory if it already exists, useful when running experiments multiple times
    do_eval=True,                  # model will be evaluated at the specified intervals
    #evaluation_strategy="epoch",   # the model will be evaluated at the end of each epoch.
    learning_rate=5e-5,            # default
    num_train_epochs=10,           # Specify the number of training epochs
    per_device_train_batch_size=2, # Set the batch size for training
    save_total_limit=2,            # Only the most recent two checkpoints will be kept.
    logging_steps = 20,             # how often to log training information, help monitor the training process.
    optim="adamw_torch"
)


In [36]:
# Instantiate the Trainer
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [39]:
import torch

!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/BeXRxFT2EyQAmBHvxVaMYQ/bert-scratch-model.pt'
model.resize_token_embeddings(30522)
model.load_state_dict(torch.load('bert-scratch-model.pt',map_location=torch.device('cpu')))

--2025-04-14 20:50:34--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/BeXRxFT2EyQAmBHvxVaMYQ/bert-scratch-model.pt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 438141816 (418M) [binary/octet-stream]
Saving to: ‘bert-scratch-model.pt.1’


2025-04-14 20:50:39 (84.6 MB/s) - ‘bert-scratch-model.pt.1’ saved [438141816/438141816]



<All keys matched successfully>

In [40]:
# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create a pipeline for the "fill-mask" task
mask_filler = pipeline("fill-mask", model=model,tokenizer=bert_tokenizer)

# Generate predictions by filling the mask in the input text
results = mask_filler(text) #top_k parameter can be set

# Print the predicted sequences
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

Device set to use cpu


Predicted token: 3, Confidence: 0.10
Predicted token: 1, Confidence: 0.07
Predicted token: one, Confidence: 0.06
Predicted token: l, Confidence: 0.05
Predicted token: dun, Confidence: 0.04


## weak performance can be due to insufficient training, lack of training data, model architecture, or not tuning hyperparameters.

## Try a pretrained model from Hugging Face: Inferencing a pretrained BERT model

In [41]:
# Load the pretrained BERT model and tokenizer
pretrained_model = BertForMaskedLM.from_pretrained('bert-base-uncased')
pretrained_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [42]:
# Define the input text with a masked token
text = "This is a [MASK] movie!"

# Create the pipeline
mask_filler = pipeline(task='fill-mask', model=pretrained_model,tokenizer=pretrained_tokenizer)

# Perform inference using the pipeline
results = mask_filler(text)
for result in results:
    print(f"Predicted token: {result['token_str']}, Confidence: {result['score']:.2f}")

Device set to use cpu


Predicted token: great, Confidence: 0.16
Predicted token: horror, Confidence: 0.08
Predicted token: good, Confidence: 0.08
Predicted token: bad, Confidence: 0.05
Predicted token: fantastic, Confidence: 0.04


- better than last one that trained only on 100 data set

- but pretrained models cannot be used for specific tasks, such as sentiment extraction or sequence classification. This is why supervised fine-tuning methods are introduced.