# Transformers, what can they do?

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [2]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m76.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-c

In [15]:
from transformers import BertConfig, TFBertModel

# Building the config
config = BertConfig()

# Building the model from the config
model_default = TFBertModel(config)
print(config)
# Creating a model from default config initialises weights with random value
# the model can be used to process input but the output will be gibberish as the model is untrained
# this could be a starting point for training the model but training such a 
# transformer is expensive - both time, compute resources 
# therefore is its advisable to use pre-trained model as below

model = TFBertModel.from_pretrained("bert-base-cased")


BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.29.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [10]:
config

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.29.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [12]:
# Save the model to local folder
model.save_pretrained("./models")
# the models folder will contain config.json - the config of the model and
# tg_model.h5 containing the weights - also called "state dictionary"



#### TOKENIZERS
- 3 types : WORD Based, Character based, sub-word based
- VOCABULARY SIZE: is the total number of unique tokens
- English language has over 500,000 words representing each word with unique number implies large vocabulary
- instead uncommon words are not tokenised and are represented by "UNK" or unknown token
- the downsize is all unknown words will have same representation when tokenized
- one way to reduce unknown tokens is to do character based tokens
- the downside is there will be large number of tokens to process and would also mean
- a reduced context when compared with word based tokenizer of similar sized context
- secondly the mean carried by character based tokens is less than that of word based token
 
- the 3 tokenizer method: sub-word tokernisation captures best of both worlds
- Sub-word tokenizer relies on the principle that frequently used words should not be split into sub-words
- but rare words should be split into meaningful subwords

In [14]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [17]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [19]:
tokenizer.save_pretrained("./models")

('./models/tokenizer_config.json',
 './models/special_tokens_map.json',
 './models/vocab.txt',
 './models/added_tokens.json',
 './models/tokenizer.json')

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [29]:
# TOKENIZER PIPELINE. "ENCODING or translating text to numbers"
# STEP 1 - create tokens
tokens=tokenizer.tokenize("Using a Transformer network is simple") 
#gives tokens : clearly this is a sub-word tokenizer
print(f"TOKENS : {tokens}")
# STEP 2 - convert to IDS
ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"IDS : {ids}")
# STEP 4  - add special tokens start / end to prepare it for model
inputs = tokenizer.prepare_for_model(ids)
print(f"INPUTS : {inputs}")

TOKENS : ['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']
IDS : [7993, 170, 13809, 23763, 2443, 1110, 3014]
INPUTS : {'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [31]:
print(f"DECODED IDS : {tokenizer.decode(inputs.input_ids)}")

DECODED IDS : [CLS] Using a Transformer network is simple [SEP]


In [33]:
# Alternately get inputs in one shot by
inputs = tokenizer("Using a Transformer network is simple")
print(f"INPUTS : {inputs}")

INPUTS : {'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [39]:
model(input_ids).logits.numpy()

array([[-2.7276206,  2.878937 ]], dtype=float32)

In [40]:
# PUTTING IT TOGETHER
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_299']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [41]:
print(output)

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-1.5606955,  1.6122805],
       [-3.6183178,  3.9137495]], dtype=float32)>, hidden_states=None, attentions=None)
