# Bert

### Bert 소개

* Transformer의 encoder 부분만 활용

* NLP 분야에 Fine-Tuning 개념 도입
* Masked Language Model[MLM] 뿐만아니라 Next Sentence Prediction[NSP]를 통해 학습

## JointEmbedding 
Bert Embedding 종류는 세가지

* Token Embeddings : token을 indices로 변경

* Segment Embeddings : 2개 문장의 단어를 구분하기 위해 0,1로 표시 ex) [0,0,0, ... 1,1,1]

* Position Embeddings : 전체 단어의 순번 

  <img alt='img0' src='./img/img0.png' style="width : 400px">

## WordPiece Embedding 

* Word Piece는 context 기반, word embedding은 단어 기반
* Bank라는 단어는 문맥에 따라 여러 의미로 쓰임. 
* ex) We Went to river Bank || I need to go to bank 
* Word Piece는 두 개의 vector를 생성한다면 word embbeding은 하나의 vector만 생성함.

  <a href = 'https://medium.com/swlh/differences-between-word2vec-and-bert-c08a3326b5d1'> 출처 : WordPiece Embedding과 Word2Vec차이 </a>


In [1]:
import torch
from torch import nn

class JointEmbedding(nn.Module) : 

    def __init__(self, vocab_size, size, device='cpu') :
        super().__init__()
        self.size = size
        self.device = device

        self.token_emb = nn.Embedding(vocab_size, size)
        self.segment_emb = nn.Embedding(vocab_size, size)

        self.norm =  nn.LayerNorm(size)

    def forward(self,input_tensor) : 
        # positional embbeding
        pos_tensor = self.attention_position(self.size, input_tensor)
        # segment embedding
        segment_tensor = torch.zeros_like(input_tensor).to(self.device)

        # embedding size의 반은 0 반은 1임
        sentence_size = input_tensor.size(-1)
        segment_tensor[:, sentence_size // 2 + 1:] = 1

        output = self.token_emb(input_tensor) + self.segment_emb(segment_tensor) + pos_tensor
        return self.norm(output)

    def attention_position(self,dim,input_tensor) :
        '''
        ????
        '''
        # input_tensor row 크기 
        batch_size = input_tensor.size(0)

        # 문장 길이
        sentence_size = input_tensor(-1)

        # pos 정의 longtype = int64
        pos = torch.arange(sentence_size, dtype=torch.long).to(self.device)

        # d = sentence 내 허용 token 개수
        d = torch.arange(dim, dtype=torch.long).to(self.device)
        d = (2*d /dim)

        # unsqueeze 공부해야할듯..
        pos = pos.unsqueeze(1)
        pos = pos / (1e4**d)

        pos[:, ::2] = torch.sin(pos[:, ::2])
        pos[:, 1::2] = torch.cos(pos[:, 1::2])

        # *pos는 처음 보는 방식인데
        return pos.expand(batch_size, *pos.size())

# 
    def numeric_position(self,dim,input_tensor) : 
        pos_tensor = torch.arange(dim,dtype=torch.long).to(self.device)
        return pos_tensor.expand_as(input_tensor)


    



### Bert 논문 기본 parameter
1. Encoder = 12
2. heads = 12
3. Hidden Layer(=embedding size) = 768
4. word piece = 30522(30522개 단어라는 말)
5. Parameter = 110M


### 110M 계산하기 
* 30522*768 = 24M
* 12 encoder = 84M 
* Dense Weight Matrix and Bias [768, 768] = 589824, [768] = 768, (589824 + 768 = 590592)
= 110M

    <a href='https://stackoverflow.com/questions/64485777/how-is-the-number-of-parameters-be-calculated-in-bert-model'>상세 링크</a>


In [None]:
import torch.nn

class Bert(nn.Module) : 
    def __init__(self,vocab_size,dim_input,dim_output, attention_heads = 12) -> None:
        '''
        vocab_size : input vocab total
        dim_input : (=hidden_layer= embedding_size) 768 
        dim_output : (=hidden_layer= embedding_size) 768
        '''
        super().__init__()
        self.embedding = JointEmbedding(vocab_size,dim_input)
        self.transformerEndoerLayer = nn.TransformerEncoderLayer(d_model=dim_input,nhead=attention_heads,activation='gelu')
        # bert Base 12 layer 
        self.transformerEncoder = nn.TransformerEncoder(self.transformerEndoerLayer,12)
        self.token_prediction_layer = nn.Linear(dim_input,vocab_size)
        self.softmax = nn.LogSoftmax(dim=-1)
        self.classification_layer = nn.Linear(dim_input,2)

    def forward(self, input_tensor, attention_mask) : 
        embedded = self.embedding(input_tensor)
        encoded = self.transformerEncoder(input_tensor,attention_mask)

        token_predictions = self.token_prediction_layer(encoded)

        # 1번째 단어 추출
        first_word = encoded[:, 0, :]

        return self.softmax(token_predictions), self.classification_layer(first_word)

    

In [3]:
1000000*110

110000000