# Bert文本分类

基于transformers库，使用预训练模型对文本分类。预训练模型在专业领域数据上使用时需要finetune效果更好。

In [1]:
import os

from transformers import pipeline

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
model_name = "hfl/chinese-macbert-base"

nlp = pipeline("sentiment-analysis",
               model=model_name,
               tokenizer=model_name,
               device=-1,  # gpu device id
               )

result = nlp("我爱你")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = nlp("我恨你")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not i

label: LABEL_1, with score: 0.5718
label: LABEL_1, with score: 0.6083


模型默认保存在：`~/.cache/huggingface/transformers`

`hfl/chinese-macbert-base`模型介绍：
https://huggingface.co/hfl/chinese-macbert-base?text=%E5%B7%B4%E9%BB%8E%E6%98%AF%5BMASK%5D%E5%9B%BD%E7%9A%84%E9%A6%96%E9%83%BD%E3%80%82

可以使用其他别人finetune过的分类模型，效果会更好。


不通过pipeline，可以自己写预测逻辑：

In [4]:
from transformers import AutoModelForSequenceClassification
import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
print("token ok")
model = AutoModelForSequenceClassification.from_pretrained(model_name)
print("model ok")


classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "中国首都是北京"
sequence_1 = "苹果有益于你的身体健康"
sequence_2 = "北京是在北回归线附近的城市"
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

# Should be paraphrase
print("-"*42)
print(sequence_0 + ' + ' + sequence_2 + ', paraphrase proba:')
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
    
# Should not be paraphrase
print("-"*42)
print(sequence_0 + ' + ' + sequence_1 + ', paraphrase proba:')
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

token ok


Some weights of the model checkpoint at hfl/chinese-macbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not i

model ok
------------------------------------------
中国首都是北京 + 北京是在北回归线附近的城市, paraphrase proba:
not paraphrase: 24%
is paraphrase: 76%
------------------------------------------
中国首都是北京 + 苹果有益于你的身体健康, paraphrase proba:
not paraphrase: 54%
is paraphrase: 46%


本节完。