spaCy的解析器组件可用于训练以预测输入文本上的任何类型的树结构-包括不是语法依赖性的语义关系。这对于会话应用程序很有用，该应用程序需要预测整个文档或聊天记录中的树，并且句子根之间的连接用于注释话语结构。例如，您可以训练spaCy的解析器来标记意图及其目标，例如属性，质量，时间和位置。

### 逐步指南
* 按顺序创建由单词，单词的头和依赖标签组成的训练数据。令牌的头部是其所附的令牌的索引。头不需要在语法上正确–它们应该表达您希望解析器学习的语义关系。对于不应接收标签的单词，您可以选择任意占位符，例如-。
* 加载模型要使用启动，或创建一个空的模型使用 spacy.blank使用您的语言的ID。如果您使用的是空白模型，请不要忘记将自定义解析器添加到管道中。如果您使用的是现有模型，请确保在训练过程中使用以下方法从管道中删除旧的解析器，并禁用所有其他管道组件：nlp.disable_pipes。这样，您将只训练解析器。
* 使用将依赖项标签添加到解析器 add_label 方法。
* 随机播放并遍历示例。对于每个示例，通过调用来更新模型nlp.update，它会逐步检查输入的单词。每个单词都做出一个预测。然后，它查阅注释以查看它是否正确。如果错了，它会调整权重，以便下次正确的动作得分更高。
* 使用保存训练模型nlp.to_disk。
* 测试模型以确保解析器按预期工作。

In [1]:
# training data: texts, heads and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
    (
        "find a cafe with great wifi",
        {
            "heads": [0, 2, 0, 5, 5, 2],  # index of token head
            "deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
        },
    ),
    (
        "find a hotel near the beach",
        {
            "heads": [0, 2, 0, 5, 5, 2],
            "deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
        },
    ),
    (
        "find me the closest gym that's open late",
        {
            "heads": [0, 0, 4, 4, 0, 6, 4, 6, 6],
            "deps": [
                "ROOT",
                "-",
                "-",
                "QUALITY",
                "PLACE",
                "-",
                "-",
                "ATTRIBUTE",
                "TIME",
            ],
        },
    ),
    (
        "show me the cheapest store that sells flowers",
        {
            "heads": [0, 0, 4, 4, 0, 4, 4, 4],  # attach "flowers" to store!
            "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"],
        },
    ),
    (
        "find a nice restaurant in london",
        {
            "heads": [0, 3, 3, 0, 3, 3],
            "deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
        },
    ),
    (
        "show me the coolest hostel in berlin",
        {
            "heads": [0, 0, 4, 4, 0, 4, 4],
            "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"],
        },
    ),
    (
        "find a good italian restaurant near work",
        {
            "heads": [0, 4, 4, 4, 0, 4, 5],
            "deps": [
                "ROOT",
                "-",
                "QUALITY",
                "ATTRIBUTE",
                "PLACE",
                "ATTRIBUTE",
                "LOCATION",
            ],
        },
    ),
]

In [2]:
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

In [4]:
nlp = spacy.blank("en")  # create blank Language class

# We'll use the built-in dependency parser class, but we want to create a
# fresh instance – just in case.
if "parser" in nlp.pipe_names:
    nlp.remove_pipe("parser")
parser = nlp.create_pipe("parser")
nlp.add_pipe(parser, first=True)

for text, annotations in TRAIN_DATA:
    for dep in annotations.get("deps", []):
        parser.add_label(dep)

In [7]:
output_dir=None
n_iter=15

pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train parser
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)

Losses {'parser': 26.828265368938446}
Losses {'parser': 30.450553886592388}
Losses {'parser': 28.348180279135704}
Losses {'parser': 24.870065785944462}
Losses {'parser': 27.60044641047716}
Losses {'parser': 26.754187911748886}
Losses {'parser': 27.736614488065243}
Losses {'parser': 30.705915331840515}
Losses {'parser': 26.23321044445038}
Losses {'parser': 21.00887693464756}
Losses {'parser': 16.1575785279274}
Losses {'parser': 15.251114595215768}
Losses {'parser': 9.836891966639087}
Losses {'parser': 7.147974300431088}
Losses {'parser': 4.645134863697422}


In [16]:
def test_model(nlp):
    texts = [
        "find a hotel with good wifi",
        "find me the cheapest gym near work",
        "show me the best hotel in berlin",
        'you do not know me', # ??
        'show me the hotel in berlin'
    ]
    docs = nlp.pipe(texts)
    for doc in docs:
        print(doc.text)
        print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])

test_model(nlp)

find a hotel with good wifi
[('find', 'ROOT', 'find'), ('hotel', 'PLACE', 'find'), ('good', 'QUALITY', 'wifi'), ('wifi', 'ATTRIBUTE', 'hotel')]
find me the cheapest gym near work
[('find', 'ROOT', 'find'), ('cheapest', 'QUALITY', 'gym'), ('gym', 'PLACE', 'me'), ('near', 'ATTRIBUTE', 'gym'), ('work', 'LOCATION', 'near')]
show me the best hotel in berlin
[('show', 'ROOT', 'show'), ('best', 'QUALITY', 'hotel'), ('hotel', 'PLACE', 'show'), ('berlin', 'LOCATION', 'hotel')]
you do not know me
[('you', 'ROOT', 'you'), ('not', 'QUALITY', 'know'), ('know', 'PLACE', 'do')]
show me the hotel in berlin
[('show', 'ROOT', 'show'), ('hotel', 'PLACE', 'show'), ('berlin', 'LOCATION', 'hotel')]


In [9]:
output_dir = Path('models')
if not output_dir.exists():
    output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to models


In [10]:
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
test_model(nlp2)

Loading from models
find a hotel with good wifi
[('find', 'ROOT', 'find'), ('hotel', 'PLACE', 'find'), ('good', 'QUALITY', 'wifi'), ('wifi', 'ATTRIBUTE', 'hotel')]
find me the cheapest gym near work
[('find', 'ROOT', 'find'), ('cheapest', 'QUALITY', 'gym'), ('gym', 'PLACE', 'me'), ('near', 'ATTRIBUTE', 'gym'), ('work', 'LOCATION', 'near')]
show me the best hotel in berlin
[('show', 'ROOT', 'show'), ('best', 'QUALITY', 'hotel'), ('hotel', 'PLACE', 'show'), ('berlin', 'LOCATION', 'hotel')]
