## Loading the sample data

In [4]:
data = ["LTIMindtree Q2FY24: Show of strength. Good revenue growth and resilient margin performance",
        "The company expects furloughs to be more pronounced in Q3 and it is guiding to a very weak quarter, with revenue decline between 1.5 percent and 3.5 percent",
        "Arkam Ventures is also an investor in Jai Kisan, one of India’s fastest-growing rural fintech platforms for farmers and retailers, and Jumbotail, India’s leading B2B food and grocery marketplace and retail platform",
       ]

## Tokenizers

### Loading Tokenizers

Loading tokenizer can be done using `from_pretrained()` method of any Tokenizer class.

#### AutoTokenizer class 
This will load the specific tokenizer based on the input provided.

In [1]:
from transformers import AutoTokenizer

In [2]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [10]:
tokenizer(data[0])

{'input_ids': [101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

#### ModelNameTokenizer Class 
This will load the model specific tokenizer from the specified checkpoint.

   For Example,  `DistibertTokenizer`

In [3]:
from transformers import DistilBertTokenizer

In [8]:
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [9]:
distilbert_tokenizer(data[0])

{'input_ids': [101, 8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Saving the Tokenizer

In [11]:
distilbert_tokenizer.save_pretrained('./../../../hf_models/distilbert-base-uncased/')

('./../../../hf_models/distilbert-base-uncased/tokenizer_config.json',
 './../../../hf_models/distilbert-base-uncased/special_tokens_map.json',
 './../../../hf_models/distilbert-base-uncased/vocab.txt',
 './../../../hf_models/distilbert-base-uncased/added_tokens.json')

## Tokenization Process

### Tokenization

Breaking the input into tokens specific to the `vocab` of the checkpoint.

In [14]:
tokens = distilbert_tokenizer.tokenize(data[0])
print(tokens)

['lt', '##imi', '##ndt', '##ree', 'q', '##2', '##fy', '##24', ':', 'show', 'of', 'strength', '.', 'good', 'revenue', 'growth', 'and', 'res', '##ili', '##ent', 'margin', 'performance']


### Encoding
Mapping tokens to input IDs

In [17]:
ids = distilbert_tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[8318, 27605, 26379, 9910, 1053, 2475, 12031, 18827, 1024, 2265, 1997, 3997, 1012, 2204, 6599, 3930, 1998, 24501, 18622, 4765, 7785, 2836]


### Decoding
Mapping input IDs back to tokens and grouping them back to the same words as in input text.

In [19]:
print(distilbert_tokenizer.decode(ids))

ltimindtree q2fy24 : show of strength. good revenue growth and resilient margin performance


In [20]:
print(distilbert_tokenizer.decode(distilbert_tokenizer(data[0])['input_ids']))

[CLS] ltimindtree q2fy24 : show of strength. good revenue growth and resilient margin performance [SEP]
