 Overview

The ALBERT model was proposed in ALBERT: A Lite BERT for Self-supervised Learning of Language Representations by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:

Splitting the embedding matrix into two smaller matrices.
Using repeating layers split among groups.

The abstract from the paper is the following:

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

This model was contributed by lysandre. This model jax version was contributed by kamalkraj. The original code can be found here.
Usage tips

ALBERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it’s more logical to have H >> E. Also, the embedding matrix is large since it’s V x E (V being the vocab size). If E < H, it has less parameters.
Layers are split in groups that share parameters (to save memory). Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.

This model was contributed by lysandre. This model jax version was contributed by kamalkraj. The original code can be found here.

Resources

The resources provided in the following sections consist of a list of official Hugging Face and community (indicated by 🌎) resources to help you get started with AlBERT. If you’re interested in submitting a resource to be included here, please feel free to open a Pull Request and we’ll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
Text Classification

AlbertForSequenceClassification is supported by this example script.

TFAlbertForSequenceClassification is supported by this example script.

FlaxAlbertForSequenceClassification is supported by this example script and notebook.

Check the Text classification task guide on how to use the model.

Token Classification

AlbertForTokenClassification is supported by this example script.

TFAlbertForTokenClassification is supported by this example script and notebook.

FlaxAlbertForTokenClassification is supported by this example script.
Token classification chapter of the 🤗 Hugging Face Course.
Check the Token classification task guide on how to use the model.

Fill-Mask

AlbertForMaskedLM is supported by this example script and notebook.
TFAlbertForMaskedLM is supported by this example script and notebook.
FlaxAlbertForMaskedLM is supported by this example script and notebook.
Masked language modeling chapter of the 🤗 Hugging Face Course.
Check the Masked language modeling task guide on how to use the model.

Question Answering

AlbertForQuestionAnswering is supported by this example script and notebook.
TFAlbertForQuestionAnswering is supported by this example script and notebook.
FlaxAlbertForQuestionAnswering is supported by this example script.
Question answering chapter of the 🤗 Hugging Face Course.
Check the Question answering task guide on how to use the model.

Multiple choice

AlbertForMultipleChoice is supported by this example script and notebook.

TFAlbertForMultipleChoice is supported by this example script and notebook.

Check the Multiple choice task guide on how to use the model.

AlbertConfig
class transformers.AlbertConfig
< source >

( vocab_size = 30000embedding_size = 128hidden_size = 4096num_hidden_layers = 12num_hidden_groups = 1num_attention_heads = 64intermediate_size = 16384inner_group_num = 1hidden_act = 'gelu_new'hidden_dropout_prob = 0attention_probs_dropout_prob = 0max_position_embeddings = 512type_vocab_size = 2initializer_range = 0.02layer_norm_eps = 1e-12classifier_dropout_prob = 0.1position_embedding_type = 'absolute'pad_token_id = 0bos_token_id = 2eos_token_id = 3**kwargs )

Parameters

vocab_size (int, optional, defaults to 30000) — Vocabulary size of the ALBERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling AlbertModel or TFAlbertModel.
embedding_size (int, optional, defaults to 128) — Dimensionality of vocabulary embeddings.
hidden_size (int, optional, defaults to 4096) — Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
num_hidden_groups (int, optional, defaults to 1) — Number of groups for the hidden layers, parameters in the same group are shared.
num_attention_heads (int, optional, defaults to 64) — Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 16384) — The dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
inner_group_num (int, optional, defaults to 1) — The number of inner repetition of attention and ffn.
hidden_act (str or Callable, optional, defaults to "gelu_new") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.
hidden_dropout_prob (float, optional, defaults to 0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0) — The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2) — The vocabulary size of the token_type_ids passed when calling AlbertModel or TFAlbertModel.
initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (float, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers.
classifier_dropout_prob (float, optional, defaults to 0.1) — The dropout ratio for attached classifiers.
position_embedding_type (str, optional, defaults to "absolute") — Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
pad_token_id (int, optional, defaults to 0) — Padding token id.
bos_token_id (int, optional, defaults to 2) — Beginning of stream token id.

eos_token_id (int, optional, defaults to 3) — End of stream token id.

This is the configuration class to store the configuration of a AlbertModel or a TFAlbertModel. It is used to instantiate an ALBERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ALBERT albert/albert-xxlarge-v2 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Examples:

In [1]:
from transformers import AlbertConfig, AlbertModel

# Initializing an ALBERT-xxlarge style configuration
albert_xxlarge_configuration = AlbertConfig()

# Initializing an ALBERT-base style configuration
albert_base_configuration = AlbertConfig(
    hidden_size=768,
    num_attention_heads=12,
    intermediate_size=3072,
)

# Initializing a model (with random weights) from the ALBERT-base style configuration
model = AlbertModel(albert_xxlarge_configuration)

# Accessing the model configuration
configuration = model.config

In [2]:
configuration

AlbertConfig {
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 16384,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "num_attention_heads": 64,
  "num_hidden_groups": 1,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.38.2",
  "type_vocab_size": 2,
  "vocab_size": 30000
}

Pytorch

In [3]:
from transformers import AutoTokenizer, AlbertModel
import torch

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AlbertModel.from_pretrained("albert/albert-base-v2")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

In [5]:
inputs

{'input_ids': tensor([[    2, 10975,    15,    51,  1952,    25, 10901,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [6]:
outputs

BaseModelOutputWithPooling(last_hidden_state=tensor([[[ 1.3997,  1.5700,  0.3336,  ..., -0.0686,  0.2804,  0.8287],
         [ 0.3306,  0.3647,  0.7145,  ..., -0.5266,  1.2512, -0.7154],
         [ 1.1538,  0.6781, -1.6579,  ...,  0.6821,  0.3878,  0.4889],
         ...,
         [ 1.5001, -0.4411,  1.2422,  ...,  1.3102,  0.0211, -1.0564],
         [ 0.4044, -0.0901,  1.0914,  ...,  0.4799,  0.6582, -1.0785],
         [ 0.0455,  0.1439, -0.0616,  ..., -0.0906,  0.1141,  0.2033]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.4210, -0.5434,  0.7271, -0.9191,  0.6473, -0.9005,  0.5429, -0.4988,
          0.5720, -0.9995,  0.9408,  0.4471, -0.1917, -0.9711, -0.9631, -0.4996,
          0.4965,  0.5118,  0.9865, -0.5392, -0.8583, -0.9906,  0.9915,  0.9829,
          0.7456, -0.5302,  0.6063, -0.9586, -0.9997, -0.5337, -1.0000,  0.5163,
          0.5575,  0.5369,  0.5593, -0.4074,  0.5447,  0.9906, -0.5874,  0.5187,
          0.5323, -0.9912, -0.8540,  0.4816,  0.50

In [4]:
last_hidden_states

tensor([[[ 1.3997,  1.5700,  0.3336,  ..., -0.0686,  0.2804,  0.8287],
         [ 0.3306,  0.3647,  0.7145,  ..., -0.5266,  1.2512, -0.7154],
         [ 1.1538,  0.6781, -1.6579,  ...,  0.6821,  0.3878,  0.4889],
         ...,
         [ 1.5001, -0.4411,  1.2422,  ...,  1.3102,  0.0211, -1.0564],
         [ 0.4044, -0.0901,  1.0914,  ...,  0.4799,  0.6582, -1.0785],
         [ 0.0455,  0.1439, -0.0616,  ..., -0.0906,  0.1141,  0.2033]]],
       grad_fn=<NativeLayerNormBackward0>)

In [7]:
from transformers import AutoTokenizer, AlbertForPreTraining
import torch

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AlbertForPreTraining.from_pretrained("albert/albert-base-v2")

input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)
# Batch size 1
outputs = model(input_ids)

prediction_logits = outputs.prediction_logits
sop_logits = outputs.sop_logits

Some weights of AlbertForPreTraining were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['sop_classifier.classifier.bias', 'sop_classifier.classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
model

AlbertForPreTraining(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)

In [9]:
input_ids

tensor([[    2, 10975,    15,    51,  1952,    25, 10901,     3]])

In [10]:
outputs

AlbertForPreTrainingOutput(loss=None, prediction_logits=tensor([[[  4.0208,  -2.3570,  -6.3041,  ...,  -3.8864, -11.4585,  -4.5589],
         [ -1.2179,  -0.5231,   0.7897,  ...,  -2.1118,  -5.1421,  -1.6695],
         [  3.7011,   2.5179,  -0.9651,  ...,  -5.2153,  -6.8370,  -3.5772],
         ...,
         [ -0.4021,   3.1921,   7.3883,  ...,  -5.6390,  -4.4805,  -0.2845],
         [ -1.6858,  -0.1080,   1.8938,  ...,  -0.5007,  -7.3870,   3.1808],
         [  0.5453,   3.3645,  -6.1220,  ...,  -4.8960,  -3.4328,  -3.6371]]],
       grad_fn=<ViewBackward0>), sop_logits=tensor([[-0.9805, -0.1571]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [11]:
prediction_logits

tensor([[[  4.0208,  -2.3570,  -6.3041,  ...,  -3.8864, -11.4585,  -4.5589],
         [ -1.2179,  -0.5231,   0.7897,  ...,  -2.1118,  -5.1421,  -1.6695],
         [  3.7011,   2.5179,  -0.9651,  ...,  -5.2153,  -6.8370,  -3.5772],
         ...,
         [ -0.4021,   3.1921,   7.3883,  ...,  -5.6390,  -4.4805,  -0.2845],
         [ -1.6858,  -0.1080,   1.8938,  ...,  -0.5007,  -7.3870,   3.1808],
         [  0.5453,   3.3645,  -6.1220,  ...,  -4.8960,  -3.4328,  -3.6371]]],
       grad_fn=<ViewBackward0>)

In [12]:
sop_logits

tensor([[-0.9805, -0.1571]], grad_fn=<AddmmBackward0>)

AlbertForMaskedLM

Albert Model with a language modeling head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

In [13]:
import torch
from transformers import AutoTokenizer, AlbertForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AlbertForMaskedLM.from_pretrained("albert/albert-base-v2")

# add mask_token
inputs = tokenizer("The capital of [MASK] is Paris.", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

# retrieve index of [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)

Some weights of the model checkpoint at albert/albert-base-v2 were not used when initializing AlbertForMaskedLM: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


'france'

In [14]:
mask_token_index

tensor([4])

In [15]:
predicted_token_id

tensor([714])

In [16]:
tokenizer.decode(predicted_token_id)

'france'

In [17]:
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100)
outputs = model(**inputs, labels=labels)
round(outputs.loss.item(), 2)

0.81

AlbertForSequenceClassification

Parameters

config (AlbertConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

Albert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

Example of single-label classification:

In [18]:
import torch
from transformers import AutoTokenizer, AlbertForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("textattack/albert-base-v2-imdb")
model = AlbertForSequenceClassification.from_pretrained("textattack/albert-base-v2-imdb")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

# To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
num_labels = len(model.config.id2label)
model = AlbertForSequenceClassification.from_pretrained("textattack/albert-base-v2-imdb", num_labels=num_labels)

labels = torch.tensor([1])
loss = model(**inputs, labels=labels).loss
round(loss.item(), 2)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

0.12

In [19]:
model

AlbertForSequenceClassification(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=768, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768,

In [20]:
inputs

{'input_ids': tensor([[    2, 10975,    15,    51,  1952,    25, 10901,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [21]:
predicted_class_id

1

In [22]:
num_labels

2

In [26]:
labels[0]

tensor(1)

In [27]:
model.config.id2label[predicted_class_id]

'LABEL_1'

Example of multi-label classification:

In [28]:
import torch
from transformers import AutoTokenizer, AlbertForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("textattack/albert-base-v2-imdb")
model = AlbertForSequenceClassification.from_pretrained("textattack/albert-base-v2-imdb", problem_type="multi_label_classification")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]

# To train a model on `num_labels` classes, you can pass `num_labels=num_labels` to `.from_pretrained(...)`
num_labels = len(model.config.id2label)
model = AlbertForSequenceClassification.from_pretrained(
    "textattack/albert-base-v2-imdb", num_labels=num_labels, problem_type="multi_label_classification"
)

labels = torch.sum(
    torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1
).to(torch.float)
loss = model(**inputs, labels=labels).loss

In [41]:
predicted_class_ids[0].item()


1

AlbertForMultipleChoice

In [42]:
from transformers import AutoTokenizer, AlbertForMultipleChoice
import torch

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AlbertForMultipleChoice.from_pretrained("albert/albert-base-v2")

prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
choice0 = "It is eaten with a fork and a knife."
choice1 = "It is eaten while held in the hand."
labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1

encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True)
outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()}, labels=labels)  # batch size is 1

# the linear classifier still needs to be trained
loss = outputs.loss
logits = outputs.logits

Some weights of AlbertForMultipleChoice were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AlbertForTokenClassification

In [43]:
from transformers import AutoTokenizer, AlbertForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("albert/albert-base-v2")
model = AlbertForTokenClassification.from_pretrained("albert/albert-base-v2")

inputs = tokenizer(
    "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt"
)

with torch.no_grad():
    logits = model(**inputs).logits

predicted_token_class_ids = logits.argmax(-1)

# Note that tokens are classified rather then input words which means that
# there might be more predicted token classes than words.
# Multiple token classes might account for the same word
predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]]

labels = predicted_token_class_ids
loss = model(**inputs, labels=labels).loss

Some weights of AlbertForTokenClassification were not initialized from the model checkpoint at albert/albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AlbertForQuestionAnswering

In [44]:
from transformers import AutoTokenizer, AlbertForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("twmkn9/albert-base-v2-squad2")
model = AlbertForQuestionAnswering.from_pretrained("twmkn9/albert-base-v2-squad2")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

inputs = tokenizer(question, text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)

# target is "nice puppet"
target_start_index = torch.tensor([12])
target_end_index = torch.tensor([13])

outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index)
loss = outputs.loss
round(loss.item(), 2)

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/716 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

Some weights of the model checkpoint at twmkn9/albert-base-v2-squad2 were not used when initializing AlbertForQuestionAnswering: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


7.36

In [45]:
tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)


'a nice puppet'