# spaCy を触ってみる

## spaCyとは

品詞推定（POS tagging）、係り受け解析（Dependency Parsing）、固有表現抽出（NER）などを高速に実行できるNLP ライブラリ

[document](https://spacy.io/)

In [1]:
import spacy

In [2]:
%load_ext autoreload
%autoreload 2

## モデルのロード

NLPモデルをロードする。英語が精度が良いらしい。

In [3]:
nlp = spacy.load("en_core_web_sm")  # モデルのロード

## サンプルテキスト

In [4]:
texts = [
    "Logistic Regression is a model widely used for binary classification.",
    "Random Forest is an ensemble model of many decision trees.",
    "Transformer is a model based on the attention mechanism.",
    "GPT is a Transformer-based model that uses only the decoder block.",
]

## 品詞のタグづけ

各単語の品詞の推定と係り受け関係を推定する

[attributes](https://spacy.io/api/token#attributes)

- lemma: tokenの基本形
- POS: 荒い粒度の品詞（[POS tags](https://universaldependencies.org/u/pos/)）
- tag: 品詞
- dep: 依存関係（係り受けのラベル）
- shape: トークン文字列の変換（アルファベット文字は x or X, 数字は d, 同じ文字の連続は長さ4文字以降切り捨て）
- is_alpha: アルファベット文字列で構成されているか
- is_stop: 停止文字列

In [5]:
doc = nlp(texts[0])

print("text | lemma | POS(荒い品詞) | tag(細かい品詞) | dep(係り受けラベル) | shape | is_alpha | is_stop")
for token in doc:
    print(
        f"{token.text} | {token.lemma_} | {token.pos_} | {token.tag_} | {token.dep_} | {token.shape_} | {token.is_alpha} | {token.is_stop}"
    )

text | lemma | POS(荒い品詞) | tag(細かい品詞) | dep(係り受けラベル) | shape | is_alpha | is_stop
Logistic | Logistic | PROPN | NNP | compound | Xxxxx | True | False
Regression | Regression | PROPN | NNP | nsubj | Xxxxx | True | False
is | be | AUX | VBZ | ROOT | xx | True | True
a | a | DET | DT | det | x | True | True
model | model | NOUN | NN | attr | xxxx | True | False
widely | widely | ADV | RB | advmod | xxxx | True | False
used | use | VERB | VBN | acl | xxxx | True | True
for | for | ADP | IN | prep | xxx | True | True
binary | binary | ADJ | JJ | amod | xxxx | True | False
classification | classification | NOUN | NN | pobj | xxxx | True | False
. | . | PUNCT | . | punct | . | False | False


## 名詞句の抽出(標準機能)

In [6]:
doc = nlp(texts[0])
print(doc.text)

for chunk in doc.noun_chunks:
    print(chunk.text)

Logistic Regression is a model widely used for binary classification.
Logistic Regression
a model
binary classification


In [7]:
doc = nlp(texts[1])
print(doc.text)

for chunk in doc.noun_chunks:
    print(chunk.text)

Random Forest is an ensemble model of many decision trees.
Random Forest
an ensemble model
many decision trees


In [8]:
doc = nlp(texts[2])
print(doc.text)

for chunk in doc.noun_chunks:
    print(chunk.text)

Transformer is a model based on the attention mechanism.
Transformer
a model
the attention mechanism


In [9]:
doc = nlp(texts[3])
print(doc.text)

for chunk in doc.noun_chunks:
    print(chunk.text)

GPT is a Transformer-based model that uses only the decoder block.
GPT
a Transformer-based model
that
only the decoder block


## 名詞句の抽出（カスタム）

シンプルな名詞句を抽出したいので、カスタムのヘルパー関数を作る。
a とか the とかいらないとか。

- dep(係り受けラベル)が `commpound` は後ろの語の修飾語なのでくっつける
    - nsubj: 主語
    - nsubjpass: 受動態の主語
    - dobj: 直接目的語
    - pobj: 前置詞の目的語
    - attr: be動詞の補語
    - compound: 名詞の前からの就職（名詞句を一つにまとめる）
    - amod: 名詞を修飾する形容詞
- 限定詞は除外
- `token.lefts` は、依存関係のある左側の単語
    - 修飾する語を表す

整理したロジック

1.	文中の各 token について：
    - token.dep_ が {"nsubj", "nsubjpass", "dobj", "pobj", "attr"} のとき
    - かつ token.pos_ in {"NOUN", "PROPN"} なら「名詞句ヘッド」とみなす
2.	そのヘッド token について token.lefts から：
    - dep_ が {"compound", "amod"} の子をすべて集める
3.	それらをヘッド token と合わせて .i（位置）順に並べて join
    - → "Logistic Regression", "binary classification" みたいな名詞句になる


In [10]:
def get_noun_chunks(doc):
    HEAD_DEPS = {"nsubj", "nsubjpass", "dobj", "pobj", "attr"}
    NOUN_POS = {"NOUN", "PROPN"}
    MODIFIER_DEPS = {"compound", "amod"}

    chunks = {}
    for token in doc:
        if (token.dep_ in HEAD_DEPS) and (token.pos_ in NOUN_POS):
            modifiers = [child for child in token.lefts if child.dep_ in MODIFIER_DEPS]
            noun_parts = sorted(modifiers + [token], key=lambda t: t.i)
            phrase = " ".join(tok.lemma_ for tok in noun_parts)
            chunks[token] = phrase
    return chunks

In [11]:
doc = nlp(texts[0])
get_noun_chunks(doc)

{Regression: 'Logistic Regression',
 model: 'model',
 classification: 'binary classification'}

## SVOを抽出（1文）

トリプル構造を抽出する。

In [12]:
import itertools
import sys
sys.path.insert(0, '../')

from src.spacy_helper import get_noun_phrase_map

In [13]:
doc = nlp(texts[0])
print(f"target sentence: {doc.text}")

noun_map = get_noun_phrase_map(doc=doc)

target sentence: Logistic Regression is a model widely used for binary classification.


In [14]:
VERB_POS = {"VERB", "AUX"}
VERB_DEP = {"ROOT", "conj"}

svos = []
for token in doc:
    # まずは動詞を見つける
    if (token.dep_ in VERB_DEP) and (token.pos_ in VERB_POS):
        print(f"verb : {token.text}, {token.lemma_}")
        # 主語(S)候補と目的語(O)候補を抽出
        subjects = [child for child in token.children if child.dep_ in {"nsubj", "nsubjpass"}]
        objects = [child for child in token.children if child.dep_ in {"dobj", "attr"}]
        print(subjects)
        print(objects)
        # 名詞句mapでフレーズを抽出
        subject_phrase = [noun_map.get(tok, tok.lemma_) for tok in subjects]
        object_phrase = [noun_map.get(tok, tok.lemma_) for tok in objects]
        # SとOの組み合わせで SVO を抽出
        trps = [{"subject": s, "verb":token.lemma_, "object": o} for s, o in itertools.product(subject_phrase, object_phrase)]
        svos.extend(trps)

for svo in svos:
    print(svo)
    print()

verb : is, be
[Regression]
[model]
{'subject': 'Logistic Regression', 'verb': 'be', 'object': 'model'}



In [15]:
from src.spacy_helper import get_svo_from_sentence

In [16]:
svos = get_svo_from_sentence(doc, noun_map)
svos

[{'subject': 'Logistic Regression', 'verb': 'be', 'object': 'model'},
 {'subject': 'model', 'verb': 'use_for', 'object': 'binary classification'}]

## SVO抽出（前置詞拡張）

In [17]:
# doc = nlp("The model is used for binary classification.")
doc = nlp(texts[0])
noun_map = get_noun_phrase_map(doc=doc)
print(f"target sentence: {doc.text}")

NOUN_POS = {"NOUN", "PROPN"}
VERB_POS = {"VERB", "AUX"}
VERB_DEP = {"ROOT", "conj", "acl"}

svos = []
for token in doc:
    # まずは動詞を見つける
    if (token.dep_ in VERB_DEP) and (token.pos_ in VERB_POS):
        print(f"verb : {token.text}, {token.lemma_}")
        # 主語(S)候補と目的語(O)候補を抽出
        subjects = [child for child in token.children if child.dep_ in {"nsubj", "nsubjpass"}]
        if not subjects and token.dep_ == "acl":  # acl 動詞の場合は head 名詞を主語候補にする
            head = token.head
            if head.pos_ in NOUN_POS:
                subjects = [head]
        objects = [child for child in token.children if child.dep_ in {"dobj", "attr"}]
        preps = [child for child in token.children if child.dep_ == "prep"]
        print("subjects: ", subjects)
        print("objects: ", objects)
        print("preps: ", preps)
        # 名詞句mapでフレーズを抽出
        subject_phrase = [noun_map.get(tok, tok.lemma_) for tok in subjects]
        object_phrase = [noun_map.get(tok, tok.lemma_) for tok in objects]
        pobj_candidates = []
        for prep in preps:
            pobj_candidates = [c for c in prep.children if c.dep_ == "pobj"]
        pp_objects = [noun_map.get(tok, tok.lemma_) for tok in pobj_candidates]
        # SとOの組み合わせで SVO を抽出
        trps = [{"subject": s, "verb":token.lemma_, "object": o} for s, o in itertools.product(subject_phrase, object_phrase)]
        svos.extend(trps)
        # 前置詞のトリプル抽出
        for prep in preps:
            verb_prep = f"{token.lemma_}_{prep.text}"
            pobj_tokens = [c for c in prep.children if c.dep_ == "pobj"]
            pp_objects = [noun_map.get(tok, tok.lemma_) for tok in pobj_tokens]
            for s in subject_phrase:
                for o in pp_objects:
                    svos.append({"subject": s, "verb": verb_prep, "object": o})

for svo in svos:
    print(svo)
    print()

target sentence: Logistic Regression is a model widely used for binary classification.
verb : is, be
subjects:  [Regression]
objects:  [model]
preps:  []
verb : used, use
subjects:  [model]
objects:  []
preps:  [for]
{'subject': 'Logistic Regression', 'verb': 'be', 'object': 'model'}

{'subject': 'model', 'verb': 'use_for', 'object': 'binary classification'}



In [18]:
from src.spacy_helper import get_svo_from_sentence

In [19]:
# 関数化した
doc = nlp("The model is used for binary classification.")
print(f"target sentence: {doc.text}")
noun_map = get_noun_phrase_map(doc=doc)

svos = get_svo_from_sentence(doc, noun_map)

for svo in svos:
    print(svo)
    print()

target sentence: The model is used for binary classification.
{'subject': 'model', 'verb': 'use_for', 'object': 'binary classification'}



In [20]:
doc = nlp(texts[0])
print(f"target sentence: {doc.text}")
noun_map = get_noun_phrase_map(doc=doc)

svos = get_svo_from_sentence(doc, noun_map)

for svo in svos:
    print(svo)
    print()

target sentence: Logistic Regression is a model widely used for binary classification.
{'subject': 'Logistic Regression', 'verb': 'be', 'object': 'model'}

{'subject': 'model', 'verb': 'use_for', 'object': 'binary classification'}

