### ALBERT 와 BERT 의 임베딩 파라미터 수 비교

In [1]:
from transformers import BertModel, AlbertModel

In [2]:
bert = BertModel.from_pretrained("bert-base-uncased")
albert = AlbertModel.from_pretrained("albert-base-v2")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertModel: ['predictions.bias', 'predictions.LayerNorm.bias', 'pre

In [3]:
def num_model_param(m):
    return sum(mi.numel() for mi in m.parameters())

In [4]:
bert_embedding = num_model_param(bert.embeddings)
print('number of BERT Embedding Parameters: {}'.format(bert_embedding))

number of BERT Embedding Parameters: 23837184


In [5]:
albert_embedding = num_model_param(albert.embeddings) + num_model_param(albert.encoder.embedding_hidden_mapping_in)
print('number of ALBERT Embedding Parameters: {}'.format(albert_embedding))

number of ALBERT Embedding Parameters: 4005120


In [6]:
100 * (albert_embedding / bert_embedding)

16.801984663960308

### ALBERT 와 BERT 의 인코더 파라미터 수 비교

In [7]:
bert_encoder = num_model_param(bert.encoder)
print('number of BERT Encoder Parameters: {}'.format(bert_encoder))

number of BERT Encoder Parameters: 85054464


In [8]:
albert_encoder = num_model_param(albert.encoder)
print('number of ALBERT Encoder Parameters: {}'.format(albert_encoder))

number of ALBERT Encoder Parameters: 7186944


In [9]:
100 * (albert_encoder / bert_encoder)

8.44981399212627

### DistilBERT 와 BERT 의 파라미터 수 비교

In [10]:
from transformers import BertTokenizer, BertModel
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')

In [11]:
from transformers import BertModel
bert = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
import torch
import numpy as np
input = torch.from_numpy(np.random.randint(0, len(tokenizer_bert.vocab), (1, 512)))

In [None]:
!pip install thop --upgrade 

In [25]:
from thop import profile
macs, params = profile(bert, inputs=(input,))



[INFO] Register count_normalization() for <class 'torch.nn.modules.normalization.LayerNorm'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.dropout.Dropout'>.
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.




In [14]:
macs, params

(43506794496.0, 85646592.0)

In [15]:
from transformers import DistilBertTokenizer, DistilBertModel

In [16]:
tokenizer_distilbert = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [17]:
distilbert = DistilBertModel.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertModel were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['distilbert.transformer.layer.4.ffn.activation.total_ops', 'distilbert.transformer.layer.1.ffn.activation.total_ops', 'distilbert.trans

In [22]:
from thop import profile
distil_macs, distil_params = profile(distilbert, inputs=(input,))



[INFO] Register count_normalization() for <class 'torch.nn.modules.normalization.LayerNorm'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.dropout.Dropout'>.
[INFO] Register count_linear() for <class 'torch.nn.modules.linear.Linear'>.




In [23]:
distil_macs, distil_params

(21753495552.0, 42528768.0)

In [26]:
macs, params

(43506794496.0, 85646592.0)

BERT 의 파라미터 수가 약 2배 더 많다

### DistilBERT 와 BERT 의 실행 속도 비교

In [27]:
input_ids = np.random.randint(0, len(tokenizer_bert), (1, 512))
attention_mask = np.ones_like(input_ids)
input_ids = torch.from_numpy(input_ids)
attention_mask = torch.from_numpy(attention_mask)

In [28]:
inputs = {
    'input_ids': input_ids,
    'attention_mask': attention_mask,
}

In [29]:
def get_latency(model, inputs):
    start = time.time()
    for _ in tqdm(range(100)):
        output = model(**inputs)
        #output = bbmodel(**encoded_input)
    end = time.time()
    #print(f'latency: {(end - start)/100}')
    return (end - start) / 100

In [33]:
import time
from tqdm import tqdm

In [35]:
latency_bert = get_latency(bert, inputs)

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:50<00:00,  1.99it/s]


In [36]:
print(f'BERT latency={latency_bert:.4f}')

BERT latency=0.5031


In [37]:
input_ids = np.random.randint(0, len(tokenizer_distilbert), (1, 512))
attention_mask = np.ones_like(input_ids)
input_ids = torch.from_numpy(input_ids)
attention_mask = torch.from_numpy(attention_mask)

inputs = {
    'input_ids': input_ids,
    'attention_mask': attention_mask,
}

latency_distilbert = get_latency(distilbert, inputs)
print(f'DistilBERT latency={latency_distilbert:.4f}')

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [00:24<00:00,  4.06it/s]

DistilBERT latency=0.2464





DistilBERT 가 BERT 에 비해 약 2배 더 빠르다