<a href="https://colab.research.google.com/github/ruthgn/HF/blob/main/02_Creating_a_Transformer_model_and_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates the process of creating and using a model with the HF library and takes a closer look at tokenizers as one of the core components of the NLP pipeline.

In [None]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.4 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 38.2 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 47.3 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.5 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 47.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 47.1 MB/s 
Collecting yarl<2.0,

The `AutoModel` class (and all of its relatives) in the HuggingFace library are actually simple wrappers over the wide variety of models available in the library--it's a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let's take a look at how this works with a BERT model.

# Creating a Transformer

The first thing we'll need to do to initialize a BERT model is load a configuration object:

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

The configuration contains many attributes that are used to build the model:

In [None]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.16.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



For instance, the `hidden_size` attribute defines the size of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers the Transformer model has.

## Different loading methods

IMPORTANT: Creating a model from the default configuration initializes it with random values. The model can be used in this state, but iw will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but this would require a long time and a lot of data. To avoid unnecessary and duplicated effort, it's imperative to be able to share and reuse models that have already been trained.

Loading a Transformer model that is already trained is simple--we can do this using the `from_pretrained()` method:

In [None]:
model = BertModel.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We could replace `BertModel` with the equivalent `AutoModel` class. We'll try doing this from now on as this produces checkpoint-agnostic code (applies even if the architecture is different, as long as the checkpoint was trained for a similar task (e.g., a sentiment analysis task)).

In the code sample above we decided not to use `BertConfig`, and instead loaded a pretrained model via the `bert-base-based` identifier (this is a model checkpoint that was trained by the authors of BERT themselves). This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, **and it can also be fine_tuned on a new task**. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

_Note: The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the HF_HOME environment variable._

# Tokenizers

Loading and saving tokenizers is as simple as it is with models. Actually, it's based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the _architecture_ of the model) as well as its vocabulary (a bit like the _weights_ of the model).

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the `BertTokenizer` class:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Similar to `AutoModel`, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

We can now use the tokenizer:

In [None]:
tokenizer("OK now, check this out")

{'input_ids': [101, 10899, 1208, 117, 4031, 1142, 1149, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

Let's see how the `input_ids` are generated. To do this, we'll need to look at the intermediate methods of the tokenizer.

## Encoding

Translating text to numbers is known as _encoding_. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs.

As we've seen, the _first step_ is to split the text into words (or parts of words, punctuation symbols, etc.), usually called _tokens_. There are multiple rules that can govern that process, which is exactly why we need to instantiate the tokenizer using the name of the mode, **to make sure we use the same rules that were used when the model was pretrained.**

The _second step_ is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a _vocabulary_, which is the part we download when we instantiate it with the `from_pretrained()` method. Again, **we need to use the same vocabulary used when the model was pretrained**.

To get a better understanding of the two steps, we'll explore them separately.

_Note: We will use some methods that perform parts of the tokenization pipepline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs._

### Tokenization

The tokenization process is done by the `tokenize()` method of the tokenizer:

In [None]:
sequence = "OK now, check this out. Transformers are really cool. Using a Transformer network is simple."
tokens = tokenizer.tokenize(sequence)

print(tokens)

['OK', 'now', ',', 'check', 'this', 'out', '.', 'Transformers', 'are', 'really', 'cool', '.', 'Using', 'a', 'Trans', '##former', 'network', 'is', 'simple', '.']


### From tokens to input IDs

The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method:

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[10899, 1208, 117, 4031, 1142, 1149, 119, 25267, 1132, 1541, 4348, 119, 7993, 170, 13809, 23763, 2443, 1110, 3014, 119]


These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model.

## Decoding

Decoding is going the other way around--from vocabulary indices, we want to get a string. This can be done with the decode() method as follow:

In [None]:
decoded_string = tokenizer.decode([10899, 1208, 117, 4031, 1142, 1149, 119, 25267, 1132, 1541, 4348, 119, 7993, 170, 13809, 23763, 2443, 1110, 3014, 119])

print(decoded_string)

OK now, check this out. Transformers are really cool. Using a Transformer network is simple.
