<a href="https://colab.research.google.com/github/stemgene/All-you-need-is-attention/blob/main/03_bert_model_architecture_params.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.bilibili.com/video/BV1xT411V7Ng/?spm_id_from=333.788&vd_source=81884c519d60bbdad4b6fd87d340415f

In [2]:
from transformers import BertModel, BertForSequenceClassification

![image](https://heidloff.net/assets/img/2023/02/transformers.png)

In [3]:
model_name = 'bert-base-uncased'
model = BertModel.from_pretrained(model_name)
# cls_model = BertForSequenceClassification(model_name)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

`bert-base-uncased`模型结构：
* Embedding: BertEmbeddings
    1. word_embeddings
    2. position_embeddings
    3. token_type_embeddings
    4. LayerNorm
    5. dropout
* encoder: BertEncoder
    1. layer 0
    2. ...
    10. layer 11
* pooler: BertPooler
    1. dense: Linear(in_feature=768, out_featuers=768)
    2. activation: Tanh()
这个模型结构是最基础的，并没有接head

`BertForSequenceClassification`的模型在上面的基础上添加了最后一层classifier，如
`(classifier): Linear(in_features=768, out_features=2)`


### 模型的summary

* bert: encoder of transformer 仅是transfomer的encoder部分。transformer是encoder-decoder(seq2seq)模型
* 从`BertForSequenceClassification`可以看出，在bert模型的基础上接上一个classifier就可以做分类，以此类推，可以在bert基础上接上自定义的模型进行fine tune
* bert三部分：
    * embedding (word, position, segment_type(token_type)),
    * encoder (12 layer) (Bert base是12layer，Bert Large是24layer)
        * self attention (kqv)
        * feed forward
    * pooler

![img](https://www.researchgate.net/publication/359301499/figure/fig1/AS:1134827726213121@1647575422725/The-overall-structure-of-the-BERT-model.png)



# 参数量统计

In [4]:
next(model.named_parameters())[0]

'embeddings.word_embeddings.weight'

In [5]:
next(model.named_parameters())[1].shape

torch.Size([30522, 768])

可以看到针对word embedding层，所有的vocabular是30522，每一个word对应的维度是768维

In [6]:
total_params = 0
total_learnable_params = 0
for name, param in model.named_parameters():
    if param.requires_grad: # 支持梯度更新表示可学习
        total_learnable_params += param.numel()  # param.numel() 表示元素elment的数量，其实就是各个维度的size相乘，如embedding层的30522*768
    total_params += param.numel() # 所有参数

print(f"total parameters is {total_params / 1e8}, learnable parameters is {total_learnable_params / 1e8}")

total parameters is 1.0948224, learnable parameters is 1.0948224


这个模型中所有的参数都是可调的，1亿参数量

## Embedding，encoder和pooler这三个模块参数量的占比

In [7]:
total_params = 0
total_embedding_params = 0
total_encoder_params = 0
total_pooler_params = 0
for name, param in model.named_parameters():
    if 'embedding' in name: # 支持梯度更新表示可学习
        total_embedding_params += param.numel()  # param.numel() 表示元素elment的数量，其实就是各个维度的size相乘，如embedding层的30522*768
    if 'encoder' in name:
        total_encoder_params += param.numel()
    if 'pooler' in name:
        total_encoder_params += param.numel()
    total_params += param.numel() # 所有参数

In [8]:
print(f"embedding parameters is {total_embedding_params / 1e8}, encoder parameters is {total_encoder_params / 1e8}, pooler is {total_pooler_params / 1e8}")

embedding parameters is 0.23837184, encoder parameters is 0.85645056, pooler is 0.0


In [9]:
print(f"Ratio: embedding parameters is {total_embedding_params / total_params}, encoder parameters is {total_encoder_params / total_params}, pooler is {total_pooler_params / total_params}")

Ratio: embedding parameters is 0.21772649152958506, encoder parameters is 0.7822735084704149, pooler is 0.0
