## Pipeline
- encoder model : 문장을 '이해'하는 데 탁월 + 분류/추출에 강세  
- self-supervised : LM은 pretraining 훈련할 때 label 데이터 줄 필요 x 스스로 생성함  
- decoder model : 문장을 '예측'하는 데 탁월 + auto-regressive 생성에 강세  
- encoder-decoder model : 문장을 '생성'하는 데 탁월 (decoder와 특징이 거의 겹침)  

In [None]:
from transformers import pipeline

# encoder model 
# classsification 
sent = pipeline('sentiment-analysis')
sent('')

classifier = pipeline('zero-shot-classification')
classifier('', candidate_labels=['A', 'B'])

ner = pipeline('ner', grouped_entities=True)
ner('')


# self-supervised
# langauge model (encoder model)
mask = pipeline('fill-mask')
mask('[mask]', top_k=3)
# language model (decoder model)
gen = pipeline('text-generator')
gen('', max_length=60, num_return_sequences=2)


# encoder-decoder model
qa = pipeline('question-answering')
qa(question='', context='')

summary = pipeline('summarization')
summary('')

trans = pipeline('translation', model='')
trans('')

## Tokenizer

```
- 방법 1  : input => id  
inputs = tokenizer(raw_inputs, padding=True, Truncation=True, return_tensors='pt')  

- 방법 2  : input => token(subword) => id  
tokens = tokenizer.tokenize(raw_inputs[0])  
inputs = tokenizer.convert_tokens_to_ids(tokens)  
```

> 방법 1에는 SOS토큰과 EOS토큰이 추가된다  
> 방법 1은 model(**inputs)를 해야한다

In [11]:
from transformers import AutoTokenizer
from transformers import BertTokenizer

# Pre-trained Tokenizer
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Pre-trained Bert Tokenizer
bert_checkpoint = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(bert_checkpoint)

# Input (방법 1)
# SOS토큰과 EOS토큰이 자동으로 추가된다
raw_inputs = ['I really want', 'to go home!!!']
inputs = tokenizer(raw_inputs)
print(inputs)

{'input_ids': [[101, 146, 1541, 1328, 102], [101, 1106, 1301, 1313, 106, 106, 106, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

In [14]:
# Input (방법 2)
sequence = raw_inputs[0] # 한 문장씩만 가능
tokens = tokenizer.tokenize(sequence)
inputs = tokenizer.convert_tokens_to_ids(tokens)
print(f'토큰 : {tokens}\nid : {inputs}')

토큰 : ['I', 'really', 'want']
id : [146, 1541, 1328]


In [None]:
# id => text
tokenizer.decode([146, 1541, 1328])

In [None]:
tokenizer.save_pretrained('저장 위치')

## Model

In [None]:
from transformers import AutoModel, AutoModelForSequenceClassification
from transformers import BertConfig, BertModel
import torch.nn.functional as F

# pre-trained Model 
model = AutoModel.from_pretrained(checkpoint)
output = model(**inputs) # 방법 1은 unpacking해야함
output.last_hidden_state.shape

# pre-trained Model 2 
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
output = model(**inputs)
predictions = F.softmax(output.logits, dim=-1) # Sentiment Analysis
model.config.id2label

# Bert Model 준비
config = BertConfig()
model = BertModel(config)

# pre-trained Bert Model
model = BertModel.from_pretrained('bert_base-cased')

In [None]:
model.save_pretrained('저장 위치')

## Sum up

### 방법 1

In [17]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = 'I have wait to get off from work'

inputs = tokenizer(sequence, return_tensors='pt')
print(inputs['input_ids'])

model(**inputs)

tensor([[ 101, 1045, 2031, 3524, 2000, 2131, 2125, 2013, 2147,  102]])


SequenceClassifierOutput(loss=None, logits=tensor([[ 2.8568, -2.4364]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

### 방법 2

In [21]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = 'I have wait to get off from work'

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids])
print(input_ids) # <sos>와 <eos>가 없음

model(input_ids)

tensor([[1045, 2031, 3524, 2000, 2131, 2125, 2013, 2147]])


SequenceClassifierOutput(loss=None, logits=tensor([[ 2.9519, -2.4365]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

## Padding

- **padding id와 관계없이** attentoin mask를 해야 padding 처리한 부분을 참조하지 않는다

attention mask로 padding token 부분을 가려줘야 한다  
  
그렇지 않으면 transformer 모델이 padding 토큰을 참조해서 '문맥'을 계산해버린다!

### 방법 1

In [33]:
sequences = ['I am hesitating over saying something......', 'I want to get off from work']
inputs = tokenizer(sequences, padding='longest')
inputs = tokenizer(sequences, padding='max_length') # 모델에 적용할 수 있는 최대 seq len 까지 padding
inputs = tokenizer(sequences, padding='max_length', max_length=8)

### 방법 2

In [24]:
sequence1_id = [[200, 200, 200]]
sequence2_id = [[200, 200]]
batch_ids = [[200, 200, 200], [200, 200, tokenizer.pad_token_id]]
# tokenizer.pad_token_id == 0
attention_mask = [[1,1,1],[1,1,0]]

model(torch.tensor(batch_ids), attention_mask=torch.tensor(attention_mask))

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

## Truncate

일반적으로 Transformer의 seq len은 512 또는 1024이다  
  
이보다 긴 경우 Truncate을 해야 한다


In [27]:
sequence = sequence[:max_sequence_length]




### 방법 1

In [35]:
sequences = ['I am hesitating over saying something......', 'I wish to get off from work']
inputs = tokenizer(sequences, truncation=True) # 모델에 적용할 수 있는 최대 seq len를 넘으면 지운다
inputs = tokenizer(sequences, max_length=8, truncation=True) # 8개를 넘으면 컷! <sos>, <eos> 포함 8개
print(inputs)

inputs = tokenizer(sequences, max_length=8, truncation=True, return_tensors='pt') # pytorch tensor
inputs = tokenizer(sequences, max_length=8, truncation=True, return_tensors='tf') # tf tensor
inputs = tokenizer(sequences, max_length=8, truncation=True, return_tensors='np') # numpy tensor

{'input_ids': [[101, 1045, 2572, 2002, 28032, 5844, 2058, 102], [101, 1045, 4299, 2000, 2131, 2125, 2013, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
