#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020

# Spotlight By Xin Hu - HanLP  

## Introduction
HanLP is a multilingual NLP library developed by Han He. This library is
built on Tensorflow 2.0 and mainly use RNN network, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable. It comes with pretrained models for various human languages including English, Chinese and many others.

HanLP provides different features to help in NLP, these features include:

- [Tokenization](#Tokenization)

- [Part-of-Speech Tagging](#Part)

- [Named Entity Recognition](#name)

- [Syntactic Dependency Parsing](#syn)

- [Semantic Dependency Parsing](#sem)

HanLP also provides a pipeline to combine all of these features, this spotlight will try to explore these features and introduce how to train a model.

## Installation
HanLP is an easy install library, just run the command as follows:

```python
pip install HanLP
```

HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory.

## Get Start
<a id = 'Tokenization'></a>
### Tokenization
For a comman user, the basic workflow always start with loading a pretrained model and apply it on his work. HanLP provides pretrained model, then we can use it directly. Take Chinese tokenizer as an example, we need to call **CTB6_CONVSEG** to load the tokenizer:

In [2]:
import hanlp
#load model from identifier
Ctokenizer = hanlp.load('CTB6_CONVSEG')

Executing op Range in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op Cast in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op LookupTableImportV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TruncatedNormal in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Mul in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Add in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op MutexV2 in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op VarIsInitializedOp in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op LogicalNot in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Assert in device /job:localhost/replica:0/

HanLP will automatically resolve the identifier CTB6_CONVSEG to an URL, then download it and unzip it. 

Once the model is loaded, we can then tokenize one or even multiple sentences through calling the tokenizer as a
function:

In [46]:
#try the loaded model
print(Ctokenizer("周瑜打黄盖"))

Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
['周瑜', '打', '黄盖']


In [47]:
print(Ctokenizer(["我想过过过儿过过的生活","香港繁體也可以進行分割"]))

Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
[['我', '想', '过', '过', '过儿', '过', '过', '的', '生活'], ['香港', '繁體', '也', '可以', '進行', '分割']]


#### Observation

From results above, it could recognize simple Chinese and traditional Chinese, then tokenize them. This will help us in different Chinese articles from China(mainland), Hong Kong(China) and Taiwan(China).

**If we want to parse English sentence, then we need to change rules of tokenizer to English rules**

In [48]:
#change sentence anylisis rules to English rules
Etokenizer = hanlp.utils.rules.tokenize_english
print(Etokenizer("ZhouYu hits HuangGai"))

['ZhouYu', 'hits', 'HuangGai']


Now we want to check how many models HanLP provides, then we can run the code as follows

In [45]:
print(hanlp.pretrained.ALL)

{'SIGHAN2005_PKU_CONVSEG': 'https://file.hankcs.com/hanlp/cws/sighan2005-pku-convseg_20200110_153722.zip', 'SIGHAN2005_MSR_CONVSEG': 'https://file.hankcs.com/hanlp/cws/convseg-msr-nocrf-noembed_20200110_153524.zip', 'CTB6_CONVSEG': 'https://file.hankcs.com/hanlp/cws/ctb6_convseg_nowe_nocrf_20200110_004046.zip', 'PKU_NAME_MERGED_SIX_MONTHS_CONVSEG': 'https://file.hankcs.com/hanlp/cws/pku98_6m_conv_ngram_20200110_134736.zip', 'CTB5_BIAFFINE_DEP_ZH': 'https://file.hankcs.com/hanlp/dep/biaffine_ctb5_20191229_025833.zip', 'CTB7_BIAFFINE_DEP_ZH': 'https://file.hankcs.com/hanlp/dep/biaffine_ctb7_20200109_022431.zip', 'PTB_BIAFFINE_DEP_EN': 'https://file.hankcs.com/hanlp/dep/ptb_dep_biaffine_20200101_174624.zip', 'SEMEVAL16_NEWS_BIAFFINE_ZH': 'https://file.hankcs.com/hanlp/sdp/semeval16-news-biaffine_20191231_235407.zip', 'SEMEVAL16_TEXT_BIAFFINE_ZH': 'https://file.hankcs.com/hanlp/sdp/semeval16-text-biaffine_20200101_002257.zip', 'SEMEVAL15_PAS_BIAFFINE_EN': 'https://file.hankcs.com/hanlp/sdp

#### Observation

From the results above, we can find that HanLP provides several different kinds of pretrained model, including: 
**classifier**,**cws**,**dep**,**glove**,**ner**,**pos**,**rnnlm**,**sdp** and **word2vec**.When we are going 
on a detailed project, these can help us quickly find what we really need.

<a id='Part'></a>
### Part-of-Speech Tagging
Taggers take list of tokens as input, then output tag of each token

In [49]:
tokens = Ctokenizer("我的希望是希望和平.")
print(tokens)
taggers = hanlp.load(hanlp.pretrained.pos.CTB5_POS_RNN_FASTTEXT_ZH)
print(taggers(tokens))

Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
['我', '的', '希望', '是', '希望', '和平', '.']
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /j

**Observation**

"希望" in English is "hope", The first "希望" is "my dream", it is a noun, while second is verb, it means "want". So HanLP can recognize these words meaning exactly and translate them in tag.
If we want to tag English words, we can load hanlp.pretrained.pos.PTB_POS_RNN_FASTTEXT_EN.

<a id = 'name'></a>
### Named Entity Recognition
NER in HanLP provides different entity in different language. For example, in English, "Obama" will be recognized as "PER", while "奥巴马" in Chinese will be "NR". NER also takes list of tokens as input, output a tuple for each token. Each tuple has four elements - (entity, type, begin, end). These elements give us a hand when we need pos of each word. I will take Chinese as example:

In [50]:
#load the model
recognizer = hanlp.load(hanlp.pretrained.ner.MSRA_NER_BERT_BASE_ZH)
print(recognizer(list("张宇爬上了长城， 喝河南胡辣汤， 在海淀宾馆睡了一觉")))

Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Done loading 197 BERT weights from: C:\Users\Administrator\AppData\Roaming\hanlp\thirdparty\storage.googleapis.com\bert_models\2018_11_03\chinese_L-12_H-768_A-12\bert_model.ckpt into <bert.model.BertModelLayer object at 0x000001CF8773BB70> (prefix:bert_8). Count of weights not found in the checkpoint was: [0]. Count of weights with mismatched shape: [0]
Unused weights from checkpoint: 
	bert/pooler/dense/bias
	bert/pooler/dense/kernel
	cls/predictions/output_bias
	cls/predictions/transform/LayerNorm/beta
	cls/predictions/transform/LayerNorm/gamma
	cls/predictions/transform/dense/bias
	cls/predictions/transform/dense/kernel
	cls/seq_relationship/output_bias
	cls/seq_relationship/output_weights
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:loca

#### Observation

This MSRA_NER_BERT_BASE_ZH is the state-of-the-art NER model based on BERT. While it can recognize some entities, there are still some ignored. NER in HanLP still needs improvement.

<a id = 'syn'></a>
### Syntactic Dependency Parsing

Parsing lies in the core of web search and recommendation. Without parsing, it will be hard for us to find the relevance between query and document. In the homwork before, we use NLTK, stemming to parse documents. But using HanLP, it takes no more than two lines of code. Take Chinese articles as example.

In [35]:
#use tag and tokens above as input
tag = taggers(tokens)
print(tokens,tag)
parse_input = []
for i in range(len(tokens)):
    parse_input.append((tokens[i],tag[i]))
print(parse_input)
syntactic_parser = hanlp.load(hanlp.pretrained.dep.CTB7_BIAFFINE_DEP_ZH)
print(syntactic_parser(parse_input))

Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
['我', '的', '希望', '是', '希望', '和平', '.'] ['PN', 'DEG', 'NN', 'VC', 'VV', 'NN', 'PU']
[('我', 'PN'), ('的', 'DEG'), ('希望', 'NN'), ('是', 'VC'), ('希望', 'VV'), ('和平', 'NN'), ('.', 'PU')]
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/devi

Syntactic dependency parsing in HanLP relies on the ICLR 2017 paper by Dozat(stanford), this paper got the highest score in graph based method.

#### Observation

Parsers take both tokens and part-of-speech tags as input. The output is a tree in CoNLL-X format[^conllx], which can be manipulated through the CoNLLSentence class.

Through using the tokens and tags before, we can easily parse any sentence by several lines of code. HanLP saves us much time to parse.

If want an English version, just load hanlp.pretrained.dep.PTB_BIAFFINE_DEP_EN, also applied.

<a id='sem'></a>
### Semantic Dependency Parsing
A graph is a generalized tree, which conveys more information about the semantic relations between tokens.

HanLP implements the biaffine[^biaffine] model which delivers the SOTA performance.

Take the same example above.

In [37]:
semantic_parser = hanlp.load(hanlp.pretrained.sdp.SEMEVAL16_NEWS_BIAFFINE_ZH)
print(semantic_parser(parse_input))

Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op AnonymousIteratorV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op IteratorGetNextSync in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op Greater in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op StridedSlice in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op 

Semantic holds the same output structure as Syntactic, however, Semantic output is really a tree not a graph, which means each node would possibly have multiple head, as the following examples show: 

In [38]:
#take a special example
Sinput = [('柴火', 'NN'), ('两', 'CD'), ('头', 'NN'), ('烧', 'VV')]
print(semantic_parser(Sinput))

Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
1	柴火	_	NN	_	_	3	Poss	_	_
1	柴火	_	NN	_	_	4	Pat	_	_
2	两	_	CD	_	_	3	Quan	_	_
3	头	_	NN	_	_	4	Loc	_	_
4	烧	_	VV	_	_	0	Root	_	_


#### Observation

"柴火" has two heads(ID:3 and ID:4). Researchers can choose syntactic parsing or semantic parsing according to  what they really like. I think this is a good benefit of HanLP. 

### Pipeline
Since parsers require part-of-speech tagging and tokenization, while taggers expects tokenization to be done beforehand, it could be nice if there is a pipeline for all the procedures. Luckily, HanLP provides such pipeline and save us expensive time.

In [41]:
pipeline = hanlp.pipeline() \
    .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
    .append(Ctokenizer, output_key='tokens') \
    .append(taggers, output_key='part_of_speech_tags') \
    .append(syntactic_parser, input_key=('tokens', 'part_of_speech_tags'), output_key='syntactic_dependencies') \
    .append(semantic_parser, input_key=('tokens', 'part_of_speech_tags'), output_key='semantic_dependencies')

Notice that the first pipe is an old-school Python function split_sentence, which splits the input text into a list of sentences. Then the later DL components can utilize the batch processing seamlessly. This results in a pipeline with one input (text) pipe, multiple flow pipes and one output (parsed document).

We can print pipeline to check its structure.

In [42]:
#show the pipeline structure
print(pipeline)

[None->LambdaComponent->sentences, sentences->NgramConvTokenizer->tokens, tokens->RNNPartOfSpeechTagger->part_of_speech_tags, ('tokens', 'part_of_speech_tags')->BiaffineDependencyParser->syntactic_dependencies, ('tokens', 'part_of_speech_tags')->BiaffineSemanticDependencyParser->semantic_dependencies]


Apply pipeline on a real text.

In [43]:
text = "胡辣汤由多种天然中草药按比例配制的汤料再加入胡椒和辣椒又用骨头汤做底料的胡辣汤，其特点是汤味浓郁、汤色靓丽、汤汁粘稠，香辣可口，十分适合配合其它早点进餐。目前，已经发展成为河南及陕西等周边省份都喜爱和知晓的小吃之一。是中国河南的特色汤类食品，被大家所喜爱，常作为早餐，其特点是麻辣鲜香，营养开胃，适合搭配油条、包子、葱油饼、锅盔，千层饼等面点。"
print(pipeline(text))

Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op OptimizeDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ModelDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op TensorDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op FlatMapDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PaddedBatchDatasetV2 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op PrefetchDataset in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op ParallelMapDataset in device /job:localhost/replica:0/task:

}


#### Note

Pipeline's output is a json dict. At any time, we can add extra pre/post-processing into the pipeline, including cleaning, custome dictionary etc. After that, we can run pipeline.save("mypipeline.json") to save it. It is same as deep learning network, you can reconstruct its structure and save it. When you need it, just load it and run. That is one of the most important reasons I recommend HanLP to process Chinese articles. 

### Train own model
HanLP provides functions to train our own model, if one has his own dataset, he could train his own model on his data, and then predict his samples. The following is the sample code:

In [None]:
from hanlp.components.tok import NgramConvTokenizer
from hanlp.datasets.cws.sighan2005.msr import SIGHAN2005_MSR_TRAIN, SIGHAN2005_MSR_VALID, SIGHAN2005_MSR_TEST
from hanlp.pretrained.word2vec import CONVSEG_W2V_NEWS_TENSITE_CHAR
import tensorflow as tf
tokenizer = NgramConvTokenizer()#net load
save_dir = './'
tokenizer.fit(SIGHAN2005_MSR_TRAIN,
              SIGHAN2005_MSR_VALID,
              save_dir,
              word_embed={'class_name': 'HanLP>Word2VecEmbedding',
                          'config': {
                              'trainable': True,
                              'filepath': CONVSEG_W2V_NEWS_TENSITE_CHAR,
                              'expand_vocab': False,
                              'lowercase': False,
                          }},
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.001,
                                                 epsilon=1e-8, clipnorm=5),
              epochs=5,
              window_size=0,
              metrics='f1',
              weight_norm=True)#train model
tokenizer.evaluate(SIGHAN2005_MSR_TEST, save_dir=save_dir)#evaluate

#### Observation

As we see the code above, HanLP provides dataset to download, so if you do not have suitable dataset, you can use it. Also, HanLP is based on Tensorflow 2.0 which has some problems at unexpected time. Luckily, HanLP developer still fix bug month by month, so I believe it will be more complete in the future.

## Discussion

HanLP is an up to date Chinese parsing toolkit. The version now is over 2.0 and its first develop platform is Java. So this guy has been updated for a long time and often utilizes state of art technology to complete it. That is the reason I really recommend it. We could not say it must be the best, it, however, could be good in time. Compared with jieba, another popular Chinese parsing tool, HanLP tries deep learning to parse and is quicker than jieba.
So in a summary, I think HanLP has these good features:

1. Really easy to install and deploy. You could easily use it.

2. Enough funtions to use. Since it has trained many models, you mostly need one of them and other functions will help you to deal with articles.

3. Up to date. Developer would cover famous nlp deep leeaning articles. What's more, it has established a forum to question, report bugs and communicate. 

## Reference

Han He, HanLP: Han Language Processing, https://github.com/hankcs/HanLP