#Models

In [1]:
!pip install transformers[sentencepiece]

Collecting transformers[sentencepiece]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[sentencepiece])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[sentencepiece])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m106.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[sentencepiece])
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
#@ CREATING A TRANSFORMER
from transformers import BertConfig, BertModel

config = BertConfig()                    # Building a config
model = BertModel(config)                # Building the model from config

In [3]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [4]:
#@ LOADING A PRETRAINED TRANSFORMERS MODEL
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [5]:
#@ SAVING THE PRETRAINED MODEL
model.save_pretrained("/content/drive/MyDrive/Hugging Face")

#Tokenizers

- Tokenizers are one of the core components of NLP Pipeline

- They translate text into data that can be processed by the model

- The models can only process numbers, so tokenizers need to convert our text inputs to numerical data

**Some tokenization algorithms are:**
1. Word-based
2. Character-based
3. Subword-based

## Word-based

- It is generally very easy to setup and use with only few rules, and it often yields descent results

**Problems:**

- Very large vocabularies
- Large quantity of out-of-vocabulary tokens
- Loss of meaning across very similar words

In [6]:
#@ SPLITTING THE WORDS INTO TOKENS - WORD BASED
tokenized_text = "I am Saugat Regmi".split()
print(tokenized_text)

['I', 'am', 'Saugat', 'Regmi']


## Character-based

Character based tokenizers split the text into characters, rather than words. This has two primary benefits:

- The vocabulary is much smaller
- There are uch fewer out-of-vocabulary(unknown) tokens, since every word can be built from characters

**Problems:**
- The sequences are translated into very large amount of tokens to be processed by the model
- Less meaningful individual tokens

This can have impact on size of the context the model will around and will reduce the size of the text we can use as input of our model.



## Subword-based tokenization

Subword-based tokenization lies between character and word-based algorithms

It relies on two principle
- Frequently used words should not be split into smaller subwords
- Rare words should be decomosed into meaningful subwords


In [7]:
#@ LOADING AND SAVING THE MODEL
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [8]:
print(tokenizer)

BertTokenizer(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


In [9]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [10]:
#@ USING THE TOKENIZER
tokenizer("Hello, I am Saugat Regmi")

{'input_ids': [101, 8667, 117, 146, 1821, 17784, 15650, 1204, 23287, 3080, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
#@ SAVING THE TOKENIZER
tokenizer.save_pretrained("/content/drive/MyDrive/Hugging Face")

('/content/drive/MyDrive/Hugging Face/tokenizer_config.json',
 '/content/drive/MyDrive/Hugging Face/special_tokens_map.json',
 '/content/drive/MyDrive/Hugging Face/vocab.txt',
 '/content/drive/MyDrive/Hugging Face/added_tokens.json',
 '/content/drive/MyDrive/Hugging Face/tokenizer.json')

# Encoding

Translating text into numbers is known as *encoding*. It is done in two process: the tokenization and followed by conversion of input IDs

- The first step is to split the text into words, i.e. called tokens
- The second step is to convert tokens into numbers, so we can build a tensor out of them and feed them to the model

In [13]:
#@ APPLYING TOKENIZATION
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Hello, I am Saugat Regmi"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['Hello', ',', 'I', 'am', 'Sa', '##uga', '##t', 'Reg', '##mi']


- The above tokenizer is `subword tokenizer`

In [14]:
#@ CONVERTING INTO THE NUMERICAL FORM
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[8667, 117, 146, 1821, 17784, 15650, 1204, 23287, 3080]


**Solving the question:**

In [17]:
#@ APPLYING TOKENIZATION
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = ["I've been waiting for a HuggingFace course my whole life",
            "I hate this so much"]

tokens = tokenizer.tokenize(sequence)
print(tokens)

#@ CONVERTING INTO THE NUMERICAL FORM
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['I', "'", 've', 'been', 'waiting', 'for', 'a', 'Hu', '##gging', '##F', '##ace', 'course', 'my', 'whole', 'life', 'I', 'hate', 'this', 'so', 'much']
[146, 112, 1396, 1151, 2613, 1111, 170, 20164, 10932, 2271, 7954, 1736, 1139, 2006, 1297, 146, 4819, 1142, 1177, 1277]


## Decoding

- Decoding is going the other: from vocabulary indices, we want to get a string. This can be done with `decode()` method

In [19]:
#@ DECODING
decoded_string = tokenizer.decode([8667, 117, 146, 1821, 17784, 15650, 1204, 23287, 3080])
print(decoded_string)

Hello, I am Saugat Regmi
