<h2 align="center">点击下列图标在线运行HanLP</h2>
<div align="center">
	<a href="https://colab.research.google.com/github/hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/con_stl.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
	<a href="https://mybinder.org/v2/gh/hankcs/HanLP/doc-zh?filepath=plugins%2Fhanlp_demo%2Fhanlp_demo%2Fzh%2Fcon_stl.ipynb" target="_blank"><img src="https://mybinder.org/badge_logo.svg" alt="Open In Binder"/></a>
</div>

## 安装

无论是Windows、Linux还是macOS，HanLP的安装只需一句话搞定：

In [None]:
!pip install hanlp -U

## 加载模型
HanLP的工作流程是先加载模型，模型的标示符存储在`hanlp.pretrained`这个包中，按照NLP任务归类。

In [1]:
import hanlp
hanlp.pretrained.constituency.ALL # 语种见名称最后一个字段或相应语料库

{'CTB9_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/constituency/ctb9_con_electra_small_20210807_161112.zip',
 'CTB9_FULL_TAG_ELECTRA_SMALL': 'https://file.hankcs.com/hanlp/constituency/ctb9_full_tag_con_electra_small_20220118_103119.zip'}

调用`hanlp.load`进行加载，模型会自动下载到本地缓存。

In [4]:
con = hanlp.load('CTB9_FULL_TAG_ELECTRA_SMALL')

## 短语句法分析
输入为已分词的一个或多个句子：

In [5]:
trees = con([["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次", "世代", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"], ["阿婆主", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司", "。"]], tasks='con')

返回值为一个`Tree`的数组:

In [6]:
print(trees)

[['TOP', [['IP', [['NP-TMP', [['_', ['2021年']]]], ['NP-PN-SBJ', [['_', ['HanLPv2.1']]]], ['VP', [['PP-BNF', [['_', ['为']], ['NP', [['_', ['生产']], ['_', ['环境']]]]]], ['VP', [['_', ['带来']], ['NP-OBJ', [['CP', [['CP', [['IP', [['VP', [['NP', [['DP', [['_', ['次']]]], ['NP', [['_', ['世代']]]]]], ['ADVP', [['_', ['最']]]], ['VP', [['_', ['先进']]]]]]]], ['_', ['的']]]]]], ['NP', [['QP', [['_', ['多']]]], ['NP', [['_', ['语种']]]]]], ['NP', [['_', ['NLP']], ['_', ['技术']]]]]]]]]], ['_', ['。']]]]]], ['TOP', [['IP', [['NP-SBJ', [['_', ['阿婆主']]]], ['VP', [['VP', [['_', ['来到']], ['NP-OBJ', [['_', ['北京']], ['NP-PN', [['_', ['立方庭']]]]]]]], ['VP', [['_', ['参观']], ['NP-OBJ', [['_', ['自然']], ['_', ['语义']], ['_', ['科技']], ['_', ['公司']]]]]]]], ['_', ['。']]]]]]]


转换为bracketed格式：

In [7]:
print(trees[0])

(TOP
  (IP
    (NP-TMP (_ 2021年))
    (NP-PN-SBJ (_ HanLPv2.1))
    (VP
      (PP-BNF (_ 为) (NP (_ 生产) (_ 环境)))
      (VP
        (_ 带来)
        (NP-OBJ
          (CP
            (CP
              (IP
                (VP
                  (NP (DP (_ 次)) (NP (_ 世代)))
                  (ADVP (_ 最))
                  (VP (_ 先进))))
              (_ 的)))
          (NP (QP (_ 多)) (NP (_ 语种)))
          (NP (_ NLP) (_ 技术)))))
    (_ 。)))


## 组装流水线

短语成分树的第一层non-terminal一般是词性标签，所以经常与词性标注一起使用。为此，先加载一个词性标注器：

In [8]:
pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)

然后创建一个函数将词性标签和句法树组装起来:

In [17]:
from hanlp_common.document import Document
def merge_pos_into_con(doc:Document):
    flat = isinstance(doc['pos'][0], str)
    if flat:
        doc = Document((k, [v]) for k, v in doc.items())
    for tree, tags in zip(doc['con'], doc['pos']):
        offset = 0
        for subtree in tree.subtrees(lambda t: t.height() == 2):
            tag = subtree.label()
            if tag == '_':
                subtree.set_label(tags[offset])
            offset += 1
    if flat:
        doc = doc.squeeze()
    return doc

之后就可以用一个流水线将三者组装起来了：

In [18]:
nlp = hanlp.pipeline() \
    .append(pos, input_key='tok', output_key='pos') \
    .append(con, input_key='tok', output_key='con') \
    .append(merge_pos_into_con, input_key='*')

该流水线的结构如下：

In [19]:
print(nlp)

[tok->TransformerTagger->pos, tok->CRFConstituencyParser->con, None->merge_pos_into_con->None]


传入一个已分词的句子试试：

In [20]:
doc = nlp(tok=["2021年", "HanLPv2.1", "带来", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"])
print(doc)

{
  "tok": [
    "2021年",
    "HanLPv2.1",
    "带来",
    "最",
    "先进",
    "的",
    "多",
    "语种",
    "NLP",
    "技术",
    "。"
  ],
  "pos": [
    "NT",
    "NR",
    "VV",
    "AD",
    "VA",
    "DEC",
    "CD",
    "NN",
    "NR",
    "NN",
    "PU"
  ],
  "con": [
    "TOP",
    [["IP", [["NP-TMP", [["NT", ["2021年"]]]], ["NP-PN-SBJ", [["NR", ["HanLPv2.1"]]]], ["VP", [["VV", ["带来"]], ["NP-OBJ", [["CP", [["CP", [["IP", [["VP", [["ADVP", [["AD", ["最"]]]], ["VP", [["VA", ["先进"]]]]]]]], ["DEC", ["的"]]]]]], ["NP", [["QP", [["CD", ["多"]]]], ["NP", [["NN", ["语种"]]]]]], ["NP", [["NR", ["NLP"]], ["NN", ["技术"]]]]]]]], ["PU", ["。"]]]]]
  ]
}


流水线的输出也是一个Document，所以支持可视化：

In [21]:
doc.pretty_print()

Token    	PoS    3       4       5       6       7       8         9            10
─────────	────────────────────────────────────────────────────────────────────────
2021年    	NT ─────────────────────────────────────────────────────►NP-TMP ────┐   
HanLPv2.1	NR ─────────────────────────────────────────────────────►NP-PN-SBJ──┤   
带来       	VV ────────────────────────────────────────────────────┐            │   
最        	AD ───►ADVP──┐                                         │            │   
先进       	VA ───►VP ───┴►VP ────►IP ───┐                         │            │   
的        	DEC──────────────────────────┴►CP ────►CP ───┐         ├►VP─────────┼►IP
多        	CD ───►QP ───┐                               │         │            │   
语种       	NN ───►NP ───┴────────────────────────►NP────┼►NP-OBJ──┘            │   
NLP      	NR ──┐                                       │                      │   
技术       	NN ──┴────────────────────────────────►NP ───┘                      │   
。   

如果要分析原始文本的话，分词是第一步，所以先加载一个分词器：

In [22]:
tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)

然后将分词器插入到流水线的第一级：

In [23]:
nlp.insert(0, tok, output_key='tok')

[None->TransformerTaggingTokenizer->tok,
 tok->TransformerTagger->pos,
 tok->CRFConstituencyParser->con,
 None->merge_pos_into_con->None]

然后就可以直接分析原始文本了：

In [24]:
print(nlp('2021年HanLPv2.1带来最先进的多语种NLP技术。')['con'])

(TOP
  (IP
    (NP-TMP (NT 2021年))
    (NP-PN-SBJ (NR HanLPv2.1))
    (VP
      (VV 带来)
      (NP-OBJ
        (CP (CP (IP (VP (ADVP (AD 最)) (VP (VA 先进)))) (DEC 的)))
        (NP (QP (CD 多)) (NP (NN 语种)))
        (NP (NR NLP) (NN 技术))))
    (PU 。)))


你明白吗？HanLP是为聪明人设计的，只要你足够聪明，你就可以优雅地实现各种功能。