<a href="https://colab.research.google.com/github/qmeng222/transformers-for-NLP/blob/main/Models_%26_Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [59]:
!pip install transformers # install the Hugging Face Transformers library



In [60]:
from transformers import AutoTokenizer # import `AutoTokenizer` class to automatically load the appropriate tokenizer for a specific pre-trained model

In [61]:
checkpoint = "bert-base-uncased" # specify the pre-trained model name
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # load the appropriate tokenizer for the specified pre-trained model

In [62]:
tokenizer # the tokenizer object

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [63]:
# call the tokenizer object as a function with an input text:
tokenizer("hello world")

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

👆The result is a dictionary containing different information, which is more than just a list of tokens.

 If you're preparing data for model input, using the callable `tokenizer` might be more appropriate.

# Methods for the tokenizer object:

In [64]:
# tokenize an input text using the `tokenize` method of the `tokenizer`：
tokens = tokenizer.tokenize("hello world")
tokens

['hello', 'world']

👆The result is a list of tokens. Each element of the list corresponds to a tokenized unit.

If you specifically need a list of tokens, then `tokenizer.tokenize` is suitable.

In [65]:
# take a list of tokens & convert each token into its corresponding integer identifier (id):
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

[7592, 2088]

In [66]:
# take a list of integer identifiers (ids) & convert them into a list of tokens:
tokenizer.convert_ids_to_tokens(ids)

['hello', 'world']

👆It is useful when you want to work with the individual tokens rather than a full decoded string.

# encode /decode:

In [67]:
ids = tokenizer.encode("hello world")
ids

[101, 7592, 2088, 102]

In [68]:
# take a list of integer identifiers (ids) & convert them into a single string:
tokenizer.decode(ids)

'[CLS] hello world [SEP]'

👆It is useful for converting the model's output, which is typically in the form of integer IDs, back into human-readable text.

In [69]:
# call the tokenizer object as a function:
model_inputs = tokenizer("hello world")
model_inputs

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

# Multiple text inputs as a list:

In [70]:
data = [
  "I like cats.",
  "Do you like cats too?",
]
tokenizer(data)

{'input_ids': [[101, 1045, 2066, 8870, 1012, 102], [101, 2079, 2017, 2066, 8870, 2205, 1029, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

# Classify a sequence of text into one or more predefined categories:

In [71]:
from transformers import AutoModelForSequenceClassification # import the class for sequence classification tasks

In [72]:
# create an instance of the specified pre-trained model for sequence classification task:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [73]:
model_inputs = tokenizer("hello world", return_tensors='pt') #  return results in a format compatible with PyTorch tensors
model_inputs

{'input_ids': tensor([[ 101, 7592, 2088,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

In [74]:
# the default was to create a binary classifier!
outputs = model(**model_inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.6284, -0.0080]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

# Create another model, switching from binary to multi-class classification:

In [75]:
# create a model for a classification task with 3 distinct labels:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [76]:
outputs = model(**model_inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2720, -0.0783, -0.1563]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [77]:
# raw scores or logits produced by the model before applying a final activation function (e.g., softmax in classification tasks):
outputs.logits # treat `logits` as an attribute

tensor([[ 0.2720, -0.0783, -0.1563]], grad_fn=<AddmmBackward0>)

In [78]:
outputs['logits'] # treat the outputs as dictionary

tensor([[ 0.2720, -0.0783, -0.1563]], grad_fn=<AddmmBackward0>)

In [79]:
outputs[0]

tensor([[ 0.2720, -0.0783, -0.1563]], grad_fn=<AddmmBackward0>)

In [80]:
len(outputs)

1

In [81]:
# tensor -> np array:
outputs.logits.detach().cpu().numpy()
# .detach(): PyTorch method to create a new tensor that shares the same storage as the original tensor but with the computation history detached
# .cpu(): iff the tensor is currently stored on a GPU, as NumPy operations don't operate on GPU tensors
# .numpy(): convert the PyTorch tensor to a NumPy array

array([[ 0.27197528, -0.07832061, -0.15634893]], dtype=float32)

In [87]:
model_inputs = tokenizer(data, padding=True, truncation=True, return_tensors='pt')
model_inputs

{'input_ids': tensor([[ 101, 1045, 2066, 8870, 1012,  102,    0,    0],
        [ 101, 2079, 2017, 2066, 8870, 2205, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

In [83]:
model_inputs.input_ids

tensor([[ 101, 1045, 2066, 8870, 1012,  102,    0,    0],
        [ 101, 2079, 2017, 2066, 8870, 2205, 1029,  102]])

In [84]:
model_inputs['input_ids']

tensor([[ 101, 1045, 2066, 8870, 1012,  102,    0,    0],
        [ 101, 2079, 2017, 2066, 8870, 2205, 1029,  102]])

👆Great, both ids have the same length as a result of padding.

In [85]:
model_inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])

👆 1 for real tokens, and 0 for padding tokens.

# Pass tokenized data through the model:

In [86]:
outputs = model(**model_inputs)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.4752, -0.1229, -0.1013],
        [ 0.4151, -0.1148, -0.1456]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

👆 And we get numerical predictions!

numerical predictions -- (post processing) --> human readable predictions