<a href="https://colab.research.google.com/github/stemgene/All-you-need-is-attention/blob/main/01_tokenizer_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.bilibili.com/video/BV1W34y157tA/?spm_id_from=333.999.0.0&vd_source=81884c519d60bbdad4b6fd87d340415f

In [24]:
import torch
import torch.nn.functional as F
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# 1 tokenizer, 构造输入

* tokenizer分词要和model相匹配。tokenizer的output要作为model的输入input，所以需要相匹配
* Auto\*Tokenizer, Auto\*Model:这种以Auto开头的库都是generic type，会自适应的找到tokenizer和对应的model
* tokenizer 服务于model input
    * len(input_ids) == len(attention_mask)
    * tokenizer("today is not that bad")这一步调用的是`tokenizer.\_\_call\_\_`函数，又进一步调用encode过程
    * `tokenizer.encode`大致等价与`tokenizer.tokenize` + `tokenizer.convert_tokens_to_ids`的效果，但没有102和103
    * `tokenizer`的工作原理就是`tokenizer.vocab`字典，存储了token => id的映射关系
        * `tokenzier.special_tokens_map`
    * `attention mask`和`padding`相匹配：如果有padding，对应位置的attention mask就为0，有单词的位置为1

In [2]:
pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

<transformers.pipelines.text_classification.TextClassificationPipeline at 0x785a2bd893f0>

`distilbert-base-uncased-finetuned-sst-2-english`是transformer中默认的sentiment analysis的模型，当没有指定特定的模型时，会默认使用

In [3]:
test_sentences = ['today is not that bad', 'today is so bad']
model_name = "distilbert-base-uncased-finetuned-sst-2-english" # transformer中默认的sentiment analysis的模型

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

当`model_name`一致并作为输入时，即实现了tokenizer和model的统一

In [9]:
batch_input = tokenizer(test_sentences, truncation=True, padding=True, return_tensors='pt')  # pt是pytorch，指定pytorch的tensor类型
batch_input

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102],
        [ 101, 2651, 2003, 2061, 2919,  102,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

## 深挖tokenizer

In [5]:
tokenizer("today is not that bad")

{'input_ids': [101, 2651, 2003, 2025, 2008, 2919, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

`tokenizer`输出两个output，即`input_ids`和`attention_mask`，且长度相等

tokenizer的encode生成`input_ids`：

In [6]:
tokenizer.encode("today is not that bad")

[101, 2651, 2003, 2025, 2008, 2919, 102]

通过`tokenize`分词：

In [7]:
tokenizer.tokenize("today is not that bad")

['today', 'is', 'not', 'that', 'bad']

In [8]:
tokenizer.convert_tokens_to_ids(['today', 'is', 'not', 'that', 'bad'])

[2651, 2003, 2025, 2008, 2919]

解码

In [11]:
tokenizer.decode( [101, 2651, 2003, 2025, 2008, 2919, 102])

'[CLS] today is not that bad [SEP]'

上面结果中的[cls]和[sep]就是典型的bert格式的解码

`tokenizer.vocab` 返回的是一个dict，包含很多单词的k:v对，encode和decode的方式就是从这个dict中去匹配单词和code

`tokenizer.vocab` = {'##gart': 27378,
 '72': 5824,
 'tod': 28681,
 'secrets': 7800,
 'issn': 23486,
 'ャ': 1728,...}

In [15]:
len(tokenizer.vocab)

30522

In [17]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [18]:
tokenizer.special_tokens_map.values()

dict_values(['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'])

In [19]:
tokenizer.convert_tokens_to_ids([special for special in tokenizer.special_tokens_map.values()])

[100, 102, 0, 101, 103]

### tokenizer的参数

* `max_length`: 虽然输入序列是不定长的，但对于模型来说需要一个确定的长度，不足的部分用padding=0来填充
* `truncation`: 如果输入超过`max_length`这个长度，会进行truncation
* `padding`: 补齐
* `return_tensors`: `pt`是pytorch风格，`tf`是tensorflow

参数中`max_length`和`padding`的值是成对出现
* `max_length=32` & `padding='max_length`: 当前者设置成具体的int，后者的值为`'max_length'`
* `padding=True`: 当`'max_length'`不设置时，可以用`padding=True`来替代，此时会自动计算所有输入中最长的长度，其余的句子都自动补零为这个长度。

In [20]:
tokenizer(test_sentences, max_length=32, truncation=True, padding='max_length', return_tensors='pt')

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0],
        [ 101, 2651, 2003, 2061, 2919,  102,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])}

In [21]:
tokenizer(test_sentences, truncation=True, padding=True, return_tensors='pt')

{'input_ids': tensor([[ 101, 2651, 2003, 2025, 2008, 2919,  102],
        [ 101, 2651, 2003, 2061, 2919,  102,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}

# 2 model 调用模型

In [10]:
model(**batch_input)  # **是解包

SequenceClassifierOutput(loss=None, logits=tensor([[-3.4620,  3.6118],
        [ 4.7508, -3.7899]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [31]:
model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.35.2",
  "vocab_size": 30522
}

"id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },

# 3 parse output 输出解析

In [33]:
with torch.no_grad():
    outputs = model(**batch_input)
    print(outputs)   # `logits`就是经过一系列神经网络，最后送到flatten（softmax）之前的那个vector
    scores = F.softmax(outputs.logits, dim=1)  # 概率化输出，和为1
    print(scores)
    labels = torch.argmax(scores, dim=1)  # 1为positive，0为negative
    print(labels)
    labels = [model.config.id2label[id] for id in labels.tolist()]
    print(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-3.4620,  3.6118],
        [ 4.7508, -3.7899]]), hidden_states=None, attentions=None)
tensor([[8.4632e-04, 9.9915e-01],
        [9.9980e-01, 1.9531e-04]])
tensor([1, 0])
['POSITIVE', 'NEGATIVE']


`logits`就是经过一系列神经网络，最后送到flatten（softmax）之前的那个vector

In [27]:
8.4632e-04 + 9.9915e-01

0.9999963199999999