# Tokenizers (PyTorch)

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [5]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer.save_pretrained("/home/mshahidul/webiner/tokenizer")

('/home/mshahidul/webiner/tokenizer/tokenizer_config.json',
 '/home/mshahidul/webiner/tokenizer/special_tokens_map.json',
 '/home/mshahidul/webiner/tokenizer/vocab.txt',
 '/home/mshahidul/webiner/tokenizer/added_tokens.json',
 '/home/mshahidul/webiner/tokenizer/tokenizer.json')

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [9]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


In [10]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


## Train a tokenizer

In [None]:
paragraph='''
From that survey, astronomers hope to learn about the birth of our Milky Way galaxy, the mysterious matter comprising much of the cosmos, and how the universe evolved into its current arrangement. Perhaps they will even uncover clues about its fate. They will also use the telescope to home in on millions of transient objects, “faint things that go bang, explode or move in the night,” said Tony Tyson, an astrophysicist at the University of California, Davis. That includes gorging black holes and collisions of dense, dead stars. Smile, universe! It is time for your close-up with the Vera C. Rubin Observatory.

The telescope, more than two decades in the making, will provide a comprehensive view of the night sky unlike anything astronomers have seen before. The project’s scientists revealed some of the first imagery it released on Monday.

“Rubin Observatory is the greatest astronomical discovery machine ever built,” Željko Ivezić, the director of construction, said during the presentation revealing the first images. He noted that for the first time, the number of observed celestial objects will be greater than the number of people living on Earth.

Over the next decade, the imagery will be patched together to create “the greatest movie of all time,” Dr. Ivezić said.

The observatory, named after the astronomer Vera Rubin, is a joint venture of the U.S. Department of Energy and the National Science Foundation. It was built on a mountain in northern Chile in the foothills of the Andes at the edge of the Atacama Desert. The location, high and dry, provides clear skies for observing the cosmos.

At the news conference on Monday, Dr. Ivezić explained that part of Rubin’s powerful capability was that its singular data set would serve many different science goals.
'''

In [None]:
sentences=paragraph.split('.')

In [None]:
# prompt: convert sentences into dataset dict

from datasets import Dataset, DatasetDict

data = {"sentence": sentences}
dataset = Dataset.from_dict(data)

raw_datasets = DatasetDict({"train": dataset})

raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence'],
        num_rows: 20
    })
})

In [None]:

training_corpus = [raw_datasets["train"][i: i + 2]["sentence"] for i in range(0, len(raw_datasets["train"]), 2)]

In [None]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [None]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

In [None]:
tokens = tokenizer.tokenize(raw_datasets['train']['sentence'][0])
tokens

['Ċ',
 'From',
 'Ġthat',
 'Ġsurvey',
 ',',
 'Ġastronomers',
 'Ġhope',
 'Ġto',
 'Ġlearn',
 'Ġabout',
 'Ġthe',
 'Ġbirth',
 'Ġof',
 'Ġour',
 'ĠMilky',
 'ĠWay',
 'Ġgalaxy',
 ',',
 'Ġthe',
 'Ġmysterious',
 'Ġmatter',
 'Ġcomprising',
 'Ġmuch',
 'Ġof',
 'Ġthe',
 'Ġcosmos',
 ',',
 'Ġand',
 'Ġhow',
 'Ġthe',
 'Ġuniverse',
 'Ġevolved',
 'Ġinto',
 'Ġits',
 'Ġcurrent',
 'Ġarrangement']