<a href="https://colab.research.google.com/github/sahug/ds-bert/blob/main/BERT%20NLP%20-%20Tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT NLP - Tokenizers**

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library.

**Install Transformers**

In [1]:
%pip install -qq transformers

**Import Tokenizer**

In [2]:
from transformers import BertTokenizer

**Tokenizer**


In [3]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Apart from tokens for each words the tokenizer also returns 2 additional tokens to represents **start**, `[CLS]`, **101**, and **end**, `[SEP]`, **102**,  of each sentences. 

**Returns**
- **input_ids:** The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

- **attention_mask:** This argument indicates to the model which tokens should be attended to, and which should not. Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]. *1 for tokens that are not masked, 0 for tokens that are masked.*

- **token_type_ids:** These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such: `[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]`

In [4]:
texts = ["This is a sentence.", "Here is another sentence. This is a little longer.", "This is short."]
tokenizer_ = tokenizer(texts)
print(tokenizer_)
print("input_ids: ", tokenizer_.input_ids)
print("token_type_ids: ", tokenizer_.token_type_ids)
print("attention_mask: ", tokenizer_.attention_mask)
print("decode: ", tokenizer.decode(tokenizer_["input_ids"][0]))
print("decode: ", tokenizer.decode(tokenizer_["input_ids"][1]))
print("decode: ", tokenizer.decode(tokenizer_["input_ids"][2]))
print("decode: ", tokenizer.decode(tokenizer_["input_ids"][0]))

{'input_ids': [[101, 2023, 2003, 1037, 6251, 1012, 102], [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], [101, 2023, 2003, 2460, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}
input_ids:  [[101, 2023, 2003, 1037, 6251, 1012, 102], [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], [101, 2023, 2003, 2460, 1012, 102]]
token_type_ids:  [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]
attention_mask:  [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]
decode:  [CLS] this is a sentence. [SEP]
decode:  [CLS] here is another sentence. this is a little longer. [SEP]
decode:  [CLS] this is short. [SEP]
decode:  [CLS] this is a sentence. [SEP]


**Tokenizer Functions**
- **tokenize** - Converts a string in a sequence of tokens, using the tokenizer. Returns list of tokens.
- **convert_tokens_to_ids** - Return ids of the tokens.
- **convert_ids_to_tokens** -  Convert token ids to token or word.
- **convert_tokens_to_string** - Returns string from the tokens.

In [5]:
def tokenize_(text):
  tokens = tokenizer.tokenize(text)
  token_ids = tokenizer.convert_tokens_to_ids(tokens)
  words = tokenizer.convert_ids_to_tokens(token_ids)
  string = tokenizer.convert_tokens_to_string(tokens)
  print("tokenize: ", tokens, "\t", "Length", len(tokens))
  print("convert_tokens_to_ids: ", token_ids, "\t", "Length", len(token_ids))
  print("convert_ids_to_tokens: ", words, "\t", "Length", len(words))
  print("convert_tokens_to_string: ", string)

**Example 1:** Plain Text

In [6]:
text = "this is a simple text example"
tokenize_(text)

tokenize:  ['this', 'is', 'a', 'simple', 'text', 'example'] 	 Length 6
convert_tokens_to_ids:  [2023, 2003, 1037, 3722, 3793, 2742] 	 Length 6
convert_ids_to_tokens:  ['this', 'is', 'a', 'simple', 'text', 'example'] 	 Length 6
convert_tokens_to_string:  this is a simple text example


**Example 2:** Comma Seperated

In [7]:
text = "this is a simple text example, this is alos an example"
tokenize_(text)

tokenize:  ['this', 'is', 'a', 'simple', 'text', 'example', ',', 'this', 'is', 'al', '##os', 'an', 'example'] 	 Length 13
convert_tokens_to_ids:  [2023, 2003, 1037, 3722, 3793, 2742, 1010, 2023, 2003, 2632, 2891, 2019, 2742] 	 Length 13
convert_ids_to_tokens:  ['this', 'is', 'a', 'simple', 'text', 'example', ',', 'this', 'is', 'al', '##os', 'an', 'example'] 	 Length 13
convert_tokens_to_string:  this is a simple text example , this is alos an example


**Example 3:** Special Characters

In [8]:
text = "this is a simple text example, (this is alos an example)"
tokenize_(text)

tokenize:  ['this', 'is', 'a', 'simple', 'text', 'example', ',', '(', 'this', 'is', 'al', '##os', 'an', 'example', ')'] 	 Length 15
convert_tokens_to_ids:  [2023, 2003, 1037, 3722, 3793, 2742, 1010, 1006, 2023, 2003, 2632, 2891, 2019, 2742, 1007] 	 Length 15
convert_ids_to_tokens:  ['this', 'is', 'a', 'simple', 'text', 'example', ',', '(', 'this', 'is', 'al', '##os', 'an', 'example', ')'] 	 Length 15
convert_tokens_to_string:  this is a simple text example , ( this is alos an example )


**Example 4:** Two Sentences.

In [9]:
text = "this is a simple text example. (This is alos an example)"
tokenize_(text)

tokenize:  ['this', 'is', 'a', 'simple', 'text', 'example', '.', '(', 'this', 'is', 'al', '##os', 'an', 'example', ')'] 	 Length 15
convert_tokens_to_ids:  [2023, 2003, 1037, 3722, 3793, 2742, 1012, 1006, 2023, 2003, 2632, 2891, 2019, 2742, 1007] 	 Length 15
convert_ids_to_tokens:  ['this', 'is', 'a', 'simple', 'text', 'example', '.', '(', 'this', 'is', 'al', '##os', 'an', 'example', ')'] 	 Length 15
convert_tokens_to_string:  this is a simple text example . ( this is alos an example )


**encode**

Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as doing `self.convert_tokens_to_ids(self.tokenize(text))`.

While this function is indeed useful, it does have a limitation: it can only process one string. In other words, it does not support batches. Therefore, to see the result of the function, we need to employ a for loop. As we can see above we can pass a list to **tokenizer** without using any loops unlike **encode** which will need loops.

**max_length, padding, truncation:** Optional Parameters

In [10]:
sentences = ["This is a sentence.", "Here is another sentence. This is a little longer.", "This is short."]
for sentence in sentences:
  print(tokenizer.encode(sentence))
  print(tokenizer.encode(sentence, max_length=12, padding="max_length", truncation=True))

[101, 2023, 2003, 1037, 6251, 1012, 102]
[101, 2023, 2003, 1037, 6251, 1012, 102, 0, 0, 0, 0, 0]
[101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102]
[101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 102]
[101, 2023, 2003, 2460, 1012, 102]
[101, 2023, 2003, 2460, 1012, 102, 0, 0, 0, 0, 0, 0]


In [11]:
def encode_(text):
  tokens = tokenizer.encode(text)
  tokens_op = tokenizer.encode(text, max_length=12, padding="max_length", truncation=True)
  print(tokens)
  print(tokens_op)

**Example 1:** Plain Text

In [12]:
text = "this is a simple text example"
encode_(text)

[101, 2023, 2003, 1037, 3722, 3793, 2742, 102]
[101, 2023, 2003, 1037, 3722, 3793, 2742, 102, 0, 0, 0, 0]


**Example 2:** Comma Seperated

In [13]:
text = "this is a simple text example, this is alos an example"
encode_(text)

[101, 2023, 2003, 1037, 3722, 3793, 2742, 1010, 2023, 2003, 2632, 2891, 2019, 2742, 102]
[101, 2023, 2003, 1037, 3722, 3793, 2742, 1010, 2023, 2003, 2632, 102]


**Example 3:** Special Characters

In [14]:
text = "this is a simple text example, (this is alos an example)"
encode_(text)

[101, 2023, 2003, 1037, 3722, 3793, 2742, 1010, 1006, 2023, 2003, 2632, 2891, 2019, 2742, 1007, 102]
[101, 2023, 2003, 1037, 3722, 3793, 2742, 1010, 1006, 2023, 2003, 102]


**Example 4:** Two Sentences.

In [15]:
text = "this is a simple text example. (This is alos an example)"
encode_(text)

[101, 2023, 2003, 1037, 3722, 3793, 2742, 1012, 1006, 2023, 2003, 2632, 2891, 2019, 2742, 1007, 102]
[101, 2023, 2003, 1037, 3722, 3793, 2742, 1012, 1006, 2023, 2003, 102]


**encode_plus**

`tokenizer.encode_plus()` is actually quite similar to the regular encode function, except that it returns a dictionary that includes all the keys that we’ve discussed above: **input_ids**, **token_type_ids**, and **attention_mask**.

Much like ``tokenizer.encode()``, the same arguments, **maximum length, padding, and truncation**, equally apply.

In [16]:
sentences = ["This is a sentence.", "Here is another sentence. This is a little longer.", "This is short."]
for sentence in sentences:
    print(tokenizer.encode_plus(sentence))
    print(tokenizer.encode_plus(sentence, max_length=12, padding="max_length", truncation=True))

{'input_ids': [101, 2023, 2003, 1037, 6251, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 1037, 6251, 1012, 102, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]}
{'input_ids': [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2182, 2003, 2178, 6251, 1012, 2023, 2003, 1037, 2210, 2936, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 2460, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2003, 2460, 1012, 102, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1,

**batch_encode_plus**

The encoding functions we have looked so far all expected a string as input. But normally, the input would come in batches, and we don’t want to use a for loop to encode each, append them to some result list, and et cetera. `tokenizer.batch_encode_plus()`, as the name implies, is a function that can handle batch inputs.

In [17]:
sentences = ["This is a sentence.", "Here is another sentence. This is a little longer.", "This is short."]
print(tokenizer.batch_encode_plus(sentence))
print(tokenizer.batch_encode_plus(sentence, max_length=12, padding="max_length", truncation=True))

{'input_ids': [[101, 1056, 102], [101, 1044, 102], [101, 1045, 102], [101, 1055, 102], [101, 102], [101, 1045, 102], [101, 1055, 102], [101, 102], [101, 1055, 102], [101, 1044, 102], [101, 1051, 102], [101, 1054, 102], [101, 1056, 102], [101, 1012, 102]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0], [0, 0, 0], [0, 0, 0], [0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1], [1, 1, 1], [1, 1, 1], [1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]}
{'input_ids': [[101, 1056, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1044, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1055, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1055, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1055, 102, 0, 0, 0, 0, 0, 0, 0, 0

**Special Tokens**

For our experiment, we need to know what BERT’s special tokens are. Specifically, we have to know what the mask token looks like in order to conduct some basic masked language modeling task.

In [18]:
tokenizer.special_tokens_map

{'cls_token': '[CLS]',
 'mask_token': '[MASK]',
 'pad_token': '[PAD]',
 'sep_token': '[SEP]',
 'unk_token': '[UNK]'}

**Tensorflow**

We can load the tokenizer or pre processor from Tensorflow hub. It is also available as a KerasLayer. It takes a list as an input.

**Returns**
The tokenizer returns a dictionary with three important itmes:

The tokenizer returns a dictionary with three important itmes:

- **input_word_ids:** Are the indices corresponding to each token in the sentence. Tensor of shape [batch_size, seq_length] with the token ids of the packed input sequence (that is, including a *start-of-sequence token, end-of-segment tokens, and padding*).

- **input_mask:** Indicates whether a token should be attended to or not. Tensor of shape [batch_size, seq_length] with value **1** at the position of all input tokens present before padding and value **0** for the padding tokens.

- **input_type_ids:** Identifies which sequence a token belongs to when there is more than one sequence. Tensor of shape [batch_size, seq_length] with the index of the input segment that gave rise to the input token at the respective position. The *first input segment (index 0) includes the start-of-sequence token and its end-of-segment token. The second and later segments (if present) include their respective end-of-segment token. Padding tokens get index 0 again.*

In [19]:
%pip install -qq tensorflow_hub
%pip install -qq tensorflow_text

In [20]:
import tensorflow_hub as hub
import tensorflow_text as text

In [21]:
preprocess = hub.load('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/1')

**Example 1:** Plain Text

In [23]:
text = "this is a simple text example"
preprocess([text])

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

**Example 2:** Comma Seperated

In [24]:
text = "this is a simple text example, this is alos an example"
preprocess([text])

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

**Example 3:** Special Characters

In [25]:
text = "this is a simple text example, (this is alos an example)"
preprocess([text])

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

**Example 4:** Two Sentences.

In [26]:
text = "this is a simple text example. (This is alos an example)"
preprocess([text])

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_type_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

In [28]:
text = ["This is a sentence.", "Here is another sentence. This is a little longer.", "This is short."]
preprocess(text)

{'input_mask': <tf.Tensor: shape=(3, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,