<code>AutoModel</code>
* Instantiate any model from a checkpoint
* Wrappers over available models

In [1]:
from transformers import BertConfig, BertModel
# build config
config = BertConfig()
# Build model from config
model = BertModel(config)

  from .autonotebook import tqdm as notebook_tqdm


[2023-11-27 15:54:29,544] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [2]:
config

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Loading trained model

In [4]:
model = BertModel.from_pretrained("bert-base-cased")


model.safetensors:   0%|                             | 0.00/436M [00:00<?, ?B/s][A
model.safetensors:   2%|▍                   | 10.5M/436M [00:01<01:01, 6.94MB/s][A
model.safetensors:   5%|▉                   | 21.0M/436M [00:04<01:28, 4.68MB/s][A
model.safetensors:   7%|█▍                  | 31.5M/436M [00:07<01:36, 4.17MB/s][A
model.safetensors:  10%|█▉                  | 41.9M/436M [00:10<01:41, 3.88MB/s][A
model.safetensors:  12%|██▍                 | 52.4M/436M [00:13<01:41, 3.76MB/s][A
model.safetensors:  14%|██▉                 | 62.9M/436M [00:16<01:41, 3.67MB/s][A
model.safetensors:  17%|███▎                | 73.4M/436M [00:18<01:39, 3.63MB/s][A
model.safetensors:  19%|███▊                | 83.9M/436M [00:21<01:37, 3.59MB/s][A
model.safetensors:  22%|████▎               | 94.4M/436M [00:24<01:35, 3.58MB/s][A
model.safetensors:  24%|█████                | 105M/436M [00:27<01:32, 3.56MB/s][A
model.safetensors:  26%|█████▌               | 115M/436M [00:30<01:29, 3.56

* The weights are cached at <code>~/.cache/huggingface/transformers</code>

* Future call of <code>from_pretrained()</code> will not download it again

* The cache folder location can be customized by setting <code>HF_HOME</code> environment variable.

## Saving methods
* <code>config.json</code>: archiecture info / attributes for model building and metadata
* <code>pytorch_model.bin</code>: state dictionary that contains model parameters

In [5]:
model.save_pretrained("save_directory")

In [6]:
!ls save_directory

config.json  pytorch_model.bin


Tokenization

In [7]:
# split by whitespace

text = "John was a chef"
tokenized_text = text.split()
print(tokenized_text)

['John', 'was', 'a', 'chef']


* A vocabulary contains tokens from corpus. Each word is assigned an ID, from 0 to size of vocabulary
* If use a word-based tokenizer
    * "dog" are represented differently than "dogs"
    * Initially, model does not know "dog" and "dog" are similar
    * Words not in vocabulary are represented with unknown token [UNK]
  * Bad to have many words tokenized into [UNK]
* The solution is using character-based tokenization
  * Much smaller vocabulary
  * Fewer out-of-vocabulary (unknown) tokens because every word can be built with characters

* Character-based tokenization is not perfect either
  *  Large amount of tokens to be processed, whereas a word would only be a token with a word-based tokenizer
  * Some argue characters are less meaningful
    * Depends on the language, a chinese character carries more information than a character in Latin language.
* The best of both worlds, subword tokenization combines both approaches


### Subword tokenization
* Frequently used words should not be split into subwords
* Rare words should be decomposed into meaningful subwords
* For example, tokenization is split into "token" and "ization". Both contains semantic meaning.
    *  Useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords
* Byte-level BPE (GPT-2), WordPiece (BERT), SentencePiece or Unigram (multilingual model)

In [8]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


tokenizer_config.json: 100%|█████████████████| 29.0/29.0 [00:00<00:00, 37.2kB/s][A

vocab.txt:   0%|                                     | 0.00/213k [00:00<?, ?B/s][A
vocab.txt: 100%|██████████████████████████████| 213k/213k [00:00<00:00, 418kB/s][A

tokenizer.json:   0%|                                | 0.00/436k [00:00<?, ?B/s][A
tokenizer.json: 100%|████████████████████████| 436k/436k [00:00<00:00, 1.30MB/s][A


In [10]:
tokenizer("using a Transformer network is simple")

{'input_ids': [101, 1606, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
# save tokenizer

tokenizer.save_pretrained("save_tokenizer")

('save_tokenizer/tokenizer_config.json',
 'save_tokenizer/special_tokens_map.json',
 'save_tokenizer/vocab.txt',
 'save_tokenizer/added_tokens.json',
 'save_tokenizer/tokenizer.json')

## Encoding
* Two steps: (a) Tokenization (b) conversion to input IDs
* Tokenization: split text into words (tokens)
    * Multiple rules
* Conversion to numbers
    * <code>Tokenizer</code> has vocabulary, downloaded when instantiated with from.pretrained()
      

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


In [13]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[1606, 170, 13809, 23763, 2443, 1110, 3014]


## Decoding
* Converts indices back to tokens
* Group together tokens that are part of the same word

In [14]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

using a Transformer network is simple
