Tokenizers split the text fed into training as input ids and attention masks.
The style of tokenization depends on the style of the tokenizer, there could be tokenizations done at diferent levels 
- bytes
- characters
- subwords/sequences
- words
- sentences

### Generate tokens in HF

In [34]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

raw_inputs = [
    "This sentense needs to be broken into pieces."
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

for k,v in inputs.items():
    print (k,v)

tokenizer.decode([  101,  1188,  1850, 22615,  2993,  1106,  1129,  3088,  1154,  3423,
           119,   102])

input_ids tensor([[  101,  1188,  1850, 22615,  2993,  1106,  1129,  3088,  1154,  3423,
           119,   102]])
token_type_ids tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
attention_mask tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


'[CLS] This sentense needs to be broken into pieces. [SEP]'

In [10]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")

raw_inputs = [
    "This sentense needs to be broken into pieces."
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

for k,v in inputs.items():
    print (k,v[0])

tokenizer.decode([   2,   48,  795, 6498, 2274,   20,   44, 2023,   77, 2491,    9,    3])

input_ids tensor([   2,   48,  795, 6498, 2274,   20,   44, 2023,   77, 2491,    9,    3])
token_type_ids tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
attention_mask tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


'[CLS] this sentense needs to be broken into pieces.[SEP]'

### Batches of inputs

Usually models are fed with multiple sequences at once. 

In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "This is a very long sentence but it does not look like it !"

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

## Convert if only single sequence into array of inputs
input_ids = torch.tensor([ids])
print (input_ids)

mout = model(input_ids)
print (mout.logits)

tensor([[2023, 2003, 1037, 2200, 2146, 6251, 2021, 2009, 2515, 2025, 2298, 2066,
         2009,  999]])
tensor([[ 3.7695, -3.1442]], grad_fn=<AddmmBackward0>)


#### Padding / Attention mask

In [38]:
input_sequences = ["This is a fountain pen.", 
                   "This is a pen."]

print(input_sequences)

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

tokens = [tokenizer.tokenize(i) for i in input_sequences]
ids = [tokenizer.convert_tokens_to_ids(i) for i in tokens]

## Convert if only single sequence into array of inputs
print ("\n Ids without padding, (one token difference is observed.)")

for k in ids:
    print(k)

ids[1].append(tokenizer.pad_token_id)

print ("\n After adding paddig token to the list.")

for k in ids:
    print(k)

input_ids = torch.tensor(ids)

mout = model(input_ids)
print ("\nLogits output by the model.")
print (mout.logits)

## padding token is considered as part of the sentense, hence we need to add an attention mask to avoid it.
attn_mask = [[1,1,1,1,1,1],
             [1,1,1,1,1,0]]

mout = model(input_ids,attention_mask=torch.tensor(attn_mask))
print ("\nLogits output by the model. (after attn mask)")
print (mout.logits)


['This is a fountain pen.', 'This is a pen.']

 Ids without padding, (one token difference is observed.)
[2023, 2003, 1037, 9545, 7279, 1012]
[2023, 2003, 1037, 7279, 1012]

 After adding paddig token to the list.
[2023, 2003, 1037, 9545, 7279, 1012]
[2023, 2003, 1037, 7279, 1012, 0]

Logits output by the model.
tensor([[-3.0607,  3.2588],
        [-0.1971,  0.4999]], grad_fn=<AddmmBackward0>)

Logits output by the model. (after attn mask)
tensor([[-3.0607,  3.2588],
        [ 2.3412, -2.1126]], grad_fn=<AddmmBackward0>)
