# Transformersによる自然言語処理

> Hagging Faceのトランスフォーマーを用いた自然言語処理 

## 自然言語処理 NLP(Natural Language Processing)

- 文章の分類：レビューの評価、スパムメールの検出、文法的に正しいかどうかの判断、2つの文が論理的に関連しているかどうかの判断
- 文の中の単語分類：品詞（名詞、動詞、形容詞）や、固有表現（人、場所、組織）の識別
- 文章内容の生成：自動生成されたテキストによる入力テキストの補完、文章の穴埋め
- 文章からの情報抽出：質問と文脈が与えられたときの、文脈からの情報に基づいた質問に対する答えの抽出
- 文章の変換：ある文章の他の言語への翻訳、文章の要約

Hagging Face https://huggingface.co/ のトランスフォーマーを使って色々なNLPの処理ができる。

- sentiment-analysis (感情分析)
- zero-shot-classification (ゼロショット分類)
- text-generation (文章生成)
- fill-mask (空所穴埋め)
- ner (named entity recognition) (固有表現認識)
- question-answering (質問応答)
- summarization (要約)
- translation (翻訳)

基本的な使い方は簡単であり、`pipeline`にやりたいことを表す上の文字列を入れて、生成されたインスタンスに文字列を入れるだけである。

In [None]:
from transformers import pipeline

## 感情分析

与えられた文章が `POSITIVE`か`NEGATIVE`かを返す。

In [None]:
classifier = pipeline("sentiment-analysis")
classifier("We are very happy to show you the 🤗 Transformers library.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

## ゼロショット分類

例を示すことなく、与えられた文章を分類する。分類したいラベルのリストを、引数 `candidate_labels`で与える。

In [None]:
classifier2 = pipeline("zero-shot-classification")
classifier2(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445950150489807, 0.11197729408740997, 0.0434277318418026]}

## 文章生成

与えた文章の続きを書く。

In [None]:
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to run a database with Nginx and PHP. We first take a look at how to run PHP and Nginx together. Then we will use an example MySQL database to create a database. In the same'}]

`pipeline`のモデル引数`model`で、使用するモデルを指定することもできる。
モデルは、https://huggingface.co/models から適当なものを選択する必要がある。

また、最大トークン数を`max_length`、生成する文章の数を`num_return_sequences`で与えることもできる。

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to make mistakes as well as avoid them all because they cost you money, and why it makes good money'},
 {'generated_text': 'In this course, we will teach you how to understand the best, most effective and most effective ways to perform the work of the American people. These'}]

## 空所穴埋め

与えた文章内の`<mask>`の部分に単語で埋めて文章にする。引数`top_k`で埋める単語数を与えることができる。

In [None]:
unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.1961977630853653,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052729532122612,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

## 固有表現認識

固有表現認識 ner (named entity recognition) とは、文章内の
人(PER: persons)、場所（LOC: locations)、組織(ORG: organizations)などを抽出するタスクである。

引数`grouped_entities`を`True`に設定すると固有名詞を結合して出力する。 


In [None]:
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796021,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

## 質問応答

質問を`question`、文章を`context`で与えることによって、質問の答えと、その単語の開始位置と終了位置を返す。

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.6949763894081116, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

## 要約

文章の要約を返す。

In [None]:
summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## 翻訳

翻訳した文章を返す。
`pipeline`のモデル引数`model`に翻訳をするためのモデルを入れる。
以下の例では、英語からフランス語への翻訳モデルを指定している。
（ドイツ語への翻訳の場合には、`translation_en_to_de` を引数とする。）


In [None]:
translator = pipeline("translation_en_to_fr")
translator("This course is produced by Hugging Face.")

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': 'Ce cours est produit par Hugging Face.'}]

## 仕組みの詳細

`pipeline`の中身は、以下の処理に分解される。

```
文字列 => トークナイザー => モデル　=> 後処理
```

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModel
from pprint import pprint
from transformers import AutoModelForSequenceClassification
import torch

###  トークナイザー

まず、入力された文字列をトークン（単語や記号など）に分割し、各トークンを整数に置き換える必要がある。
これには、`AutoTokenizer` クラスの`from_pretrained`メソッドを使用する。
引数には、https://huggingface.co/models にあるモデル名 `checkpoint`を与える。

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
pprint(tokenizer)

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


生成したトークナーザー`tokenizer`に文字列（のリスト）を与えると、変換された数値情報を含んだ辞書が生成される。
辞書のキーは、どのトークンに注意するかを表す`attention_mask`と入力を数値に変換した多次元配列を表す `input_ids` である。

この際、どの深層学習フレームワークを使うかを表す`return_tensors`を指定する必要がある。
ここでは、PyTorchを使うので、引数に`pt`を指定する。



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
pprint(inputs)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]])}


### モデル

続いてモデルクラスのインスタンスを生成する。
ここでは、`AutoModel`クラスの`from_pretrained`メソッドを使用する。

ここで生成したモデルは、トランスフォーマーの基本部分だけをもち、出力は入力の特徴を抽出した多次元配列（テンソル）である。

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
pprint(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

トークナイザーで生成した辞書を展開してモデルに入力すると、PyTorchのテンソルが出力されていることが確認できる。

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [None]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

今度は、実際に感情分析を行うための層を含んだモデルを、
`AutoModelForSequenceClassification`クラスを用いて生成する。

出力の`logits`に保管されているテンソルが得られた数値である。

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model2 = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs2 = model2(**inputs)

In [None]:
outputs2

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### 後処理

得られたテンソルをソフトマック関数を用いて確率に変換する。これが予測値になる。

In [None]:
predictions = torch.nn.functional.softmax(outputs2.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


最初の文の予測値は [0.0402, 0.9598]、2番目の文の予測値は[0.9995, 0.0005]である。
これは最初の文は、1である確率が高く、2番目の文は0である確率が高いことを示している。

モデルで用いられたラベルを得るには、モデルの`id2label`属性をみる。

In [None]:
model2.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

したがって、最初の文章は`POSITIVE`、2番目の文章は`NEGATIVE`であると判定される。