<a href="https://colab.research.google.com/github/vessln/Deep_learning/blob/main/5_Advanced_Neural_Network_Architectures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [36]:
import pandas as pd
import numpy as np

import tensorflow as tf

from transformers import pipeline, AutoConfig, AutoTokenizer
from transformers.models.gpt2 import TFGPT2Model, TFGPT2LMHeadModel

from tensorflow.keras.applications import resnet50

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, GlobalAvgPool2D, Dropout

# Advanced Neural Network Architectures

## Hugging face. Transformers. NLP

**BERT base model** - pretrained model on English language using a masked language modeling (MLM) objective.

In [2]:
# the model predicts the missing word [MASK]:
predictor = pipeline("fill-mask", model = "bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


In [3]:
predictor("This is a cat with [MASK] fur.")

[{'score': 0.09475713223218918,
  'token': 2304,
  'token_str': 'black',
  'sequence': 'this is a cat with black fur.'},
 {'score': 0.09283755719661713,
  'token': 2317,
  'token_str': 'white',
  'sequence': 'this is a cat with white fur.'},
 {'score': 0.09080683439970016,
  'token': 2829,
  'token_str': 'brown',
  'sequence': 'this is a cat with brown fur.'},
 {'score': 0.05554262548685074,
  'token': 2417,
  'token_str': 'red',
  'sequence': 'this is a cat with red fur.'},
 {'score': 0.053200945258140564,
  'token': 3756,
  'token_str': 'yellow',
  'sequence': 'this is a cat with yellow fur.'}]

In [4]:
predictor("The most beautiful girl is [MASK].")

[{'score': 0.04016110301017761,
  'token': 8764,
  'token_str': 'mia',
  'sequence': 'the most beautiful girl is mia.'},
 {'score': 0.01859879679977894,
  'token': 2033,
  'token_str': 'me',
  'sequence': 'the most beautiful girl is me.'},
 {'score': 0.013345465064048767,
  'token': 4532,
  'token_str': 'sarah',
  'sequence': 'the most beautiful girl is sarah.'},
 {'score': 0.013320278376340866,
  'token': 4698,
  'token_str': 'anna',
  'sequence': 'the most beautiful girl is anna.'},
 {'score': 0.011734695173799992,
  'token': 5586,
  'token_str': 'rachel',
  'sequence': 'the most beautiful girl is rachel.'}]

In [5]:
# the model predicts sentiment
# score - the model's confidence in the correctness of the label (regression)

sentiment_predictor = pipeline("sentiment-analysis", model = "finiteautomata/bertweet-base-sentiment-analysis")

config.json:   0%|          | 0.00/949 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Device set to use cpu


In [6]:
sentiment_predictor("You are cool!")

[{'label': 'POS', 'score': 0.9835598468780518}]

In [7]:
sentiment_predictor("You are ok, I guess!")

[{'label': 'POS', 'score': 0.6155866384506226}]

In [8]:
sentiment_predictor("Ти си страхотен!")

[{'label': 'NEU', 'score': 0.9044066071510315}]

**Pretrained models** have two parts:
1. Base model (Feature Extractor) extracts features from the input data. It is trained on a huge amount of data. Example: in the BERT, the base part understands the grammar and meaning of words. In the ResNet, the base part recognizes objects in images.
2. Head is additional task-specific layers, that "hang" on the base model. They are designed to solve a specific task. These layers take the output of the base model and transform it into a final result related to a specific task.

How to adapt a Pretrained model:
- fine-tuning – I train the entire model (base + head) on my data, starting with the already learned weights from the base model (if I have enough data).
- feature extraction – I freeze the base model (weights arent updated) and train only the head (if I have a small amount of data).

In [9]:
gpt2 = TFGPT2Model(config = AutoConfig.from_pretrained("gpt2"))

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [10]:
# gpt2 model is feature extractor. I can use it like a normal tensorflow model

In [11]:
type(gpt2)

In [12]:
# architecture settings:
gpt2.config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.47.1",
  "use_cache": true,
  "vocab_size": 50257
}

In [13]:
# got2 uses BPE tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [14]:
tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

n_ctx = 1024 is prompt length. If it is less than 1024 it complements it, if it is greater - it truncates it to 1024 tokens:

In [15]:
tokenizer.add_special_tokens({"pad_token": "<|endoftext|>"})

1

In [16]:

tokenizer("This is a [MASK].")

{'input_ids': [1212, 318, 257, 685, 31180, 42, 4083], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

attention_mask specifies which tokens in a given input should be included in the calculations (1) and which should be ignored (0 - padding, to equalize lengths).

In [17]:
tokenizer.decode([1212, 318, 257, 3797, 13])

'This is a cat.'

In [18]:
model_input = tokenizer("This is a [MASK].")

In [19]:
output = gpt2(input_ids = tf.constant(model_input["input_ids"]))

In [20]:
# it is like a dictionary:
output.keys()

odict_keys(['last_hidden_state', 'past_key_values'])

**last_hidden_state** is features (latent representations). They are used for further tasks. They are the result of training the model and contain
information about the semantic and contextual meaning of the input tokens.

**past_key_values** contains keys and values ​​for each layer of the attention mechanism. They ​​are needed to calculate attention scores for new tokens.

In [21]:
output["last_hidden_state"]

<tf.Tensor: shape=(7, 768), dtype=float32, numpy=
array([[ 1.1214095 , -0.8640525 ,  0.13023531, ...,  0.31707507,
         1.0533041 ,  0.18391262],
       [ 0.5812209 , -1.0613099 ,  0.8348378 , ..., -0.7311867 ,
         0.8308601 ,  0.55502856],
       [ 0.7847384 , -1.3256708 ,  0.6562492 , ..., -0.80023485,
         0.9549153 ,  0.46295825],
       ...,
       [ 1.1406943 , -1.7758677 ,  0.65799713, ...,  0.04174722,
         1.2192308 ,  0.94241923],
       [ 1.025761  , -1.4551395 ,  0.57776505, ...,  0.14108682,
         1.1733803 ,  0.8208883 ],
       [ 1.2687958 , -1.0308347 ,  0.2923348 , ...,  0.105415  ,
         1.5348504 ,  1.1095463 ]], dtype=float32)>

In [22]:
model_input

{'input_ids': [1212, 318, 257, 685, 31180, 42, 4083], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Tensor shape = (7, 768): \
len([1212, 318, 257, 685, 31180, 42, 4083]) = 7 \
For each there is 768-dimensional vector.

**TFGPT2LMHeadModel** is pre-trained GPT-2 model extended with a Language Modeling (LM) head. The LMHead adds a layer that predicts the probabilities of the next words (tokens) based on the output from the base GPT-2 model.

In [23]:
gpt2lm = TFGPT2LMHeadModel(config = AutoConfig.from_pretrained("gpt2"))

In [24]:
# the output of the model before softmax is applied:
result = gpt2lm(input_ids = tf.constant(model_input["input_ids"]))["logits"]

In [25]:
outputs = tf.argmax(result, axis = -1)[0].numpy()

In [26]:
tokenizer.decode(outputs)

' fadesAliceAliceAlice� plethora unsub'

In [27]:
gpt2lm.generate()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


<tf.Tensor: shape=(1, 20), dtype=int32, numpy=
array([[50256, 49691, 34136, 34136, 15199, 15199, 15185, 15185, 15185,
        15185, 15185, 44379, 44379, 20198, 20198, 20198, 20198, 20198,
        20198, 20198]], dtype=int32)>

## Keras. Foundational models for vision

### Fine-Tuning

In [41]:
tf.keras.backend.clear_session()

ResNet50 is foundational model, that makes classification. I remove the head - last layer that make classification:

In [42]:
backbone = resnet50.ResNet50(include_top = False)

In [43]:
backbone.summary()

backbone.trainable = False - the parameters (weights) of the backbone model arent updated during training. Useful when I want to use the pretrained model purely as a feature extractor and don't want to modify its learned representations:

In [40]:
# freezes all the weights:
# backbone.trainable = False

In [44]:
# freezes only a part of the model:
for layer in backbone.layers[1:60]:
  layer.trainable = False

In [45]:
model = Sequential([
    Input((299, 299, 3)),
    backbone,
    GlobalAvgPool2D(),
    Dense(256, activation = "relu"),
    Dropout(0.5),
    Dense(128, activation = "relu"),
    Dense(20, activation = "softmax"),
])

In [46]:
model.summary()

In [48]:
# there are 64 kernels with these dimentions:
backbone.layers[2].kernel.shape

TensorShape([7, 7, 3, 64])