#### The text classification can be applied on pair of sentences
<div>
<img src="image/pair_of_sentence1.png" width=800/>
</div>

#### In GLUE benckmark, 8 of 10 tasks concern pairs of sentences
<div>
<img src="image/pair_of_sentence2.png" width=800/>
</div>

#### Model like BERT are pretrained to recognize relationships between two sentences. For training, BERT will show pairs of sentences and need to predict both the value of the randomly masked tokens and whether the second sentence follows the first sentence
<div><img src="image/pair_of_sentence3.png" width=800></div>

#### The ***AutoTokenizer*** instance accept sentence pairs as well as single sentence, the ***token_type_ids*** can be indexed from the ***tokenizer*** output to get a mask indicating which sentence the token belongs to
<div><img src="image/pair_of_sentence4.png" width=1000></div>

In [5]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
output = tokenizer("My name is Qiyao Xue", "I am a Chinese")
print(output)
print(output["token_type_ids"])

{'input_ids': [101, 2026, 2171, 2003, 18816, 3148, 2080, 15990, 2063, 102, 1045, 2572, 1037, 2822, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]


#### To process several pairs of sentences together, just pass the list of first sentences followed by the list of second sentences

In [19]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
output = tokenizer(
    ["My name is Qiyao Xue.", "I am a Chinese."],
    ["This is a HuggingFace course.", "It is great."],
    padding=True
    )
print(f"output:", output)
print(f"input_ids:", output["input_ids"])
print(f"token_type_ids:", output["token_type_ids"])
print(f"attention_mask:", output["attention_mask"])

output: {'input_ids': [[101, 2026, 2171, 2003, 18816, 3148, 2080, 15990, 2063, 1012, 102, 2023, 2003, 1037, 17662, 12172, 2607, 1012, 102], [101, 1045, 2572, 1037, 2822, 1012, 102, 2009, 2003, 2307, 1012, 102, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]}
input_ids: [[101, 2026, 2171, 2003, 18816, 3148, 2080, 15990, 2063, 1012, 102, 2023, 2003, 1037, 17662, 12172, 2607, 1012, 102], [101, 1045, 2572, 1037, 2822, 1012, 102, 2009, 2003, 2307, 1012, 102, 0, 0, 0, 0, 0, 0, 0]]
token_type_ids: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]
attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]]




#### Those inputs are then ready to go through a sequence classification model

In [26]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
batch = tokenizer(
    ["My name is Qiyao Xue.", "I am a Chinese."],
    ["This is a HuggingFace course.", "It is great."],
    padding=True,
    return_tensors="pt"
)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**batch)
print(outputs)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SequenceClassifierOutput(loss=None, logits=tensor([[-0.2570, -0.7383],
        [-0.1203, -0.4321]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [28]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions.detach())

tensor([[0.6181, 0.3819],
        [0.5773, 0.4227]])


In [34]:
id2label = model.config.id2label
print(id2label)
print(f"sentence pair 1: {id2label[predictions[0].argmax().item()]}, sentence pair 2: {id2label[predictions[1].argmax().item()]}")

{0: 'LABEL_0', 1: 'LABEL_1'}
sentence pair 1: LABEL_0, sentence pair 2: LABEL_0
