# Exploring Tokenizer

## Loading the sample data

In [1]:
data = ["LTIMindtree Q2FY24: Show of strength. Good revenue growth and resilient margin performance",
        "The company expects furloughs to be more pronounced in Q3 and it is guiding to a very weak quarter, with revenue decline between 1.5 percent and 3.5 percent",
        "Arkam Ventures is also an investor in Jai Kisan, one of India’s fastest-growing rural fintech platforms for farmers and retailers, and Jumbotail, India’s leading B2B food and grocery marketplace and retail platform",
       ]

## Loading Tokenizers

Loading tokenizer can be done using `from_pretrained()` method of any Tokenizer class.

### AutoTokenizer class 
This will load the specific tokenizer based on the input provided.

In [2]:
from transformers import AutoTokenizer

In [3]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [4]:
tokenizer(data[0])

{'input_ids': [101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### [ModelName]Tokenizer Class 
This will load the model specific tokenizer from the specified checkpoint.

   For Example,  `DistibertTokenizer`

In [5]:
from transformers import DistilBertTokenizer

In [6]:
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [7]:
distilbert_tokenizer(data[0])

{'input_ids': [101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Saving the Tokenizer

In [8]:
distilbert_tokenizer.save_pretrained('./../../../hf_models/distilbert-base-uncased/')

('./../../../hf_models/distilbert-base-uncased/tokenizer_config.json',
 './../../../hf_models/distilbert-base-uncased/special_tokens_map.json',
 './../../../hf_models/distilbert-base-uncased/vocab.txt',
 './../../../hf_models/distilbert-base-uncased/added_tokens.json')

## Tokenization Process

### Tokenization

Breaking the input into tokens specific to the `vocab` of the checkpoint.

In [9]:
tokens = distilbert_tokenizer.tokenize(data[0])
print(tokens)

['lt', '##imi', '##ndt', '##ree', 'q', '##2', '##fy', '##24', ':', 'show', 'of', 'strength', '.', 'good', 'revenue', 'growth', 'and', 'res', '##ili', '##ent', 'margin', 'performance']


### Encoding
Mapping tokens to input IDs

In [10]:
ids = distilbert_tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836]


### Decoding
Mapping input IDs back to tokens and grouping them back to the same words as in input text.

In [11]:
print(distilbert_tokenizer.decode(ids))

ltimindtree q2fy24 : show of strength. good revenue growth and resilient margin performance


In [12]:
print(distilbert_tokenizer.decode(distilbert_tokenizer(data[0])['input_ids']))

[CLS] ltimindtree q2fy24 : show of strength. good revenue growth and resilient margin performance [SEP]


## Handling Batches

1. Model accepts input in batches, i.e. multiple sentences, all at once.
2. In a dataset of n sentences, sentences could be of varying lengths. Hence, they needed to be padded or truncated, so that every sentence in the batch is of the same length.
3. Every tokenizer has some special tokens such as `[CLS]`, `[SEP]`, etc. One of them is `[PAD]` token as well.
4. *Attention masks* are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (such as `[PAD]`).

In [13]:
model_inputs = distilbert_tokenizer(data)
model_inputs

{'input_ids': [[101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102], [101, 1996, 2194, 24273, 6519, 23743, 5603, 2015, 2000, 2022, 2062, 8793, 1999, 1053, 2509, 1998, 2009, 2003, 14669, 2000, 1037, 2200, 5410, 4284, 1010, 2007, 6599, 6689, 2090, 1015, 1012, 1019, 3867, 1998, 1017, 1012, 1019, 3867, 102], [101, 15745, 3286, 13252, 2003, 2036, 2019, 14316, 1999, 17410, 11382, 8791, 1010, 2028, 1997, 2634, 1521, 1055, 7915, 1011, 3652, 3541, 10346, 15007, 7248, 2005, 6617, 1998, 16629, 1010, 1998, 18414, 13344, 14162, 1010, 2634, 1521, 1055, 2877, 1038, 2475, 2497, 2833, 1998, 13025, 18086, 1998, 7027, 4132, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

##### Padding with model's max_length parameter 

In [14]:
model_inputs = distilbert_tokenizer(data, padding="max_length")
model_inputs

{'input_ids': [[101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

##### Padding with length of longest sequence in the batch

In [15]:
model_inputs = distilbert_tokenizer(data, padding="longest")
model_inputs

{'input_ids': [[101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1996, 2194, 24273, 6519, 23743, 5603, 2015, 2000, 2022, 2062, 8793, 1999, 1053, 2509, 1998, 2009, 2003, 14669, 2000, 1037, 2200, 5410, 4284, 1010, 2007, 6599, 6689, 2090, 1015, 1012, 1019, 3867, 1998, 1017, 1012, 1019, 3867, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 15745, 3286, 13252, 2003, 2036, 2019, 14316, 1999, 17410, 11382, 8791, 1010, 2028, 1997, 2634, 1521, 1055, 7915, 1011, 3652, 3541, 10346, 15007, 7248, 2005, 6617, 1998, 16629, 1010, 1998, 18414, 13344, 14162, 1010, 2634, 1521, 1055, 2877, 1038, 2475, 2497, 2833, 1998, 13025, 18086, 1998, 7027, 4132, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1

##### Truncating the sequence that are larger than model's max_length parameter.

In [16]:
model_inputs = distilbert_tokenizer(data, truncation=True)
model_inputs

{'input_ids': [[101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102], [101, 1996, 2194, 24273, 6519, 23743, 5603, 2015, 2000, 2022, 2062, 8793, 1999, 1053, 2509, 1998, 2009, 2003, 14669, 2000, 1037, 2200, 5410, 4284, 1010, 2007, 6599, 6689, 2090, 1015, 1012, 1019, 3867, 1998, 1017, 1012, 1019, 3867, 102], [101, 15745, 3286, 13252, 2003, 2036, 2019, 14316, 1999, 17410, 11382, 8791, 1010, 2028, 1997, 2634, 1521, 1055, 7915, 1011, 3652, 3541, 10346, 15007, 7248, 2005, 6617, 1998, 16629, 1010, 1998, 18414, 13344, 14162, 1010, 2634, 1521, 1055, 2877, 1038, 2475, 2497, 2833, 1998, 13025, 18086, 1998, 7027, 4132, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

##### Final input to the model

1. Padding till the length of the longest sequence.
2. Truncating all those sentences with length greater than the model's max_length.
3. Returning PyTorch tensor

In [17]:
model_inputs = distilbert_tokenizer(data, truncation=True, padding=True, return_tensors='pt')
model_inputs

{'input_ids': tensor([[  101,  8318, 27605, 26379,  9910,  1053,  2475, 12031, 18827,  1024,
          2265,  1997,  3997,  1012,  2204,  6599,  3930,  1998, 24501, 18622,
          4765,  7785,  2836,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101,  1996,  2194, 24273,  6519, 23743,  5603,  2015,  2000,  2022,
          2062,  8793,  1999,  1053,  2509,  1998,  2009,  2003, 14669,  2000,
          1037,  2200,  5410,  4284,  1010,  2007,  6599,  6689,  2090,  1015,
          1012,  1019,  3867,  1998,  1017,  1012,  1019,  3867,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [  101, 15745,  3286, 13252,  2003,  2036,  2019, 14316,  1999, 17410,
         11382,  8791,  1010,  2028,  1997,  2634,  1521,  1055,  7915,  1011,
          3652,  3541, 10346, 15007,