#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Spotlight: pyLTP

### Submitted by: Hanyang Li

### Due: March 26, 2020

Language Technology Platform (LTP) is a Chinese language processing library developed by HIT-SCIR, pyLTP is the python implementation of LTP. LTP provides sufficient and efficient Chinese language processing modules.

With LTP, you can process text in Chinese. First, splitting the sentences in the text. Then splitting Chinese words in each sentence, this is different from splitting words in English, because there is no spaces between Chinese words. After that, you can determine the pos of each words. With the pos tags, you can do more analyses like name entitity recognizing and sementic role labelling.

### Installation

#### Support and dependency

|       |Python 2.6|Python 2.7|Python 3.4|Python 3.5|Python 3.6|Conda Python|
|:-----:|:--------:|:--------:|:--------:|:--------:|:--------:|:----------:|
|Linux  |Support   |Support   |Support   |Support   |Support   |Nonsupport  |
|Mac OS |Support   |Support   |Support   |Support   |Support   |Nonsupport  |
|Windows|Nonsupport|Nonsupport|Nonsupport|Support   |Support   |Nonsupport  |

#### Setup

Install pyLTP using pip by running in your terminal:

```
$ pip install pyltp
```

#### Model files downloading

pyLTP has trained models for different language processing modules. You can download the models in http://ltp.ai/download.html. This notes uses model version 3.4.0.

### To Begin

Before using pyLTP processing Chinese language, we need to load the directory of models downloaded in last parts.

In [1]:
import os
LTP_DATA_DIR='./ltp_data_v3.4.0'

### Splitting Sentences

pyLTP can split the sentences in the text base on the punctuations. This is basically identical to the English language processing. LTP will split the text when detecting punctuations like `。`, `！`, `？`, `……`, `；`.

Take a paragraph of Joker Wikipedia (https://zh.wikipedia.org/wiki/小丑_(電影)) for instance:

In [2]:
from pyltp import SentenceSplitter
sentences = SentenceSplitter.split('1981年，哥谭市充满失业、犯罪，导致许多人穷困潦倒、失去基本权利。身为社会局外人的亚瑟·佛莱克立志成为一位单口喜剧演员，做起派对小丑来供养他的年迈母亲潘妮……亚瑟本身患有一种罕见精神疾病，导致他在不合时宜的时候会大笑，只能接受社福机构人员的治疗以获取药物；一次工作时，亚瑟被一群小孩偷走招牌，被他们引入小巷里暴打，他的同事蓝道送他一把自保用的左轮手枪！回家时，亚瑟开始对他的邻居苏菲·杜蒙感兴趣，邀请她前来观看他的单口喜剧表演，随即顺利跟她开始交往？')
for i in range(len(sentences)):
    print("Sentence " + str(i) + ": " + sentences[i] + "\n")

Sentence 0: 1981年，哥谭市充满失业、犯罪，导致许多人穷困潦倒、失去基本权利。

Sentence 1: 身为社会局外人的亚瑟·佛莱克立志成为一位单口喜剧演员，做起派对小丑来供养他的年迈母亲潘妮……

Sentence 2: 亚瑟本身患有一种罕见精神疾病，导致他在不合时宜的时候会大笑，只能接受社福机构人员的治疗以获取药物；

Sentence 3: 一次工作时，亚瑟被一群小孩偷走招牌，被他们引入小巷里暴打，他的同事蓝道送他一把自保用的左轮手枪！

Sentence 4: 回家时，亚瑟开始对他的邻居苏菲·杜蒙感兴趣，邀请她前来观看他的单口喜剧表演，随即顺利跟她开始交往？



As we can see above, the text is splitted into sentences by the puncruations above, and other puntuations are ignored by the sentence splitter.

### Word Segment

After splitting the sentences, we can split the words in sentences. pyLTP can segment words base on trained model. 

Take the `Sentence 0` for example, to make it easier to understand, I will show you the English version of this sentence and the true segments of it:

* Chinese: 1981年，哥谭市充满失业、犯罪，导致许多人穷困潦倒、失去基本权利。
* English: In 1981, Gotham is rife with crime and unemployment, leaving swathes of the population lost of basic rights and impoverished.

The true segments are:

|Chinese|English|
|:-:|:-:|
|1981年|In 1981|
|，|,|
|哥谭市|Gotham|
|充满|is rife with|
|失业|unemployment|
|、|and|
|犯罪|crime|
|导致|leaving|
|许多|swathes of|
|人|the population|
|穷困潦倒|impoverished|
|、|and|
|失去|lost of|
|基本|basic|
|权利|rights|

And the words splitted by pyLTP are:

In [3]:
from pyltp import Segmentor

cws_model_path=os.path.join(LTP_DATA_DIR,'cws.model')

segmentor=Segmentor()
segmentor.load(cws_model_path)

words=segmentor.segment('1981年，哥谭市充满失业、犯罪，导致许多人穷困潦倒、失去基本权利。')

for i in range(len(words)):
    print("Word " + str(i).rjust(2) + ": " + words[i])

segmentor.release()

Word  0: 1981年
Word  1: ，
Word  2: 哥
Word  3: 谭市
Word  4: 充满
Word  5: 失业
Word  6: 、
Word  7: 犯罪
Word  8: ，
Word  9: 导致
Word 10: 许多
Word 11: 人
Word 12: 穷困
Word 13: 潦倒
Word 14: 、
Word 15: 失去
Word 16: 基本
Word 17: 权利
Word 18: 。


As we can see above, most of the words are splitted very well except word Gotham and impoverished.

The model of pyLTP is trained by common word segments, so it cannot detect the words, which are in specific areas, well.

The word impoverished in Chinese is an idiom, it is composed by two common words, so pyLTP make mistake when splitting this word.

To solve the problem above, we can segment words base on lexicon. Once we add Gotham and impoverished into the lexicon and load it to pyLTP word segmentor, it will not split the words in lexicon apart anymore.

In [4]:
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model')

from pyltp import Segmentor

segmentor = Segmentor()
segmentor.load_with_lexicon(cws_model_path, './lexicon.txt')

words = segmentor.segment('1981年，哥谭市充满失业、犯罪，导致许多人穷困潦倒、失去基本权利。')

for i in range(len(words)):
    print("Word " + str(i).rjust(2) + ": " + words[i])

segmentor.release()

Word  0: 1981年
Word  1: ，
Word  2: 哥谭市
Word  3: 充满
Word  4: 失业
Word  5: 、
Word  6: 犯罪
Word  7: ，
Word  8: 导致
Word  9: 许多
Word 10: 人
Word 11: 穷困潦倒
Word 12: 、
Word 13: 失去
Word 14: 基本
Word 15: 权利
Word 16: 。


Now, we have got the right word segments. Let's translate these words into English.

In [5]:
en_words = ['In 1981',',','Gotham','is rife with','unemployment','and','crime',',','leaving',
            'swathes of','the population','impoverished','and','lost of','basic','rights','.']
for i in range(len(words)):
    print("Word " + str(i).rjust(2) + ": " + words[i].ljust(7) + "\t(" + en_words[i] + ")")

Word  0: 1981年  	(In 1981)
Word  1: ，      	(,)
Word  2: 哥谭市    	(Gotham)
Word  3: 充满     	(is rife with)
Word  4: 失业     	(unemployment)
Word  5: 、      	(and)
Word  6: 犯罪     	(crime)
Word  7: ，      	(,)
Word  8: 导致     	(leaving)
Word  9: 许多     	(swathes of)
Word 10: 人      	(the population)
Word 11: 穷困潦倒   	(impoverished)
Word 12: 、      	(and)
Word 13: 失去     	(lost of)
Word 14: 基本     	(basic)
Word 15: 权利     	(rights)
Word 16: 。      	(.)


### Determine Pos

pyLTP can determine the pos of the words base on trained model and the word segments.

The pos tag table is shown as below (some of the examples have no English version):

|Tag|Description|Example|
|:-:|:-:|:-:|
|a|adjective|美丽 (beautiful)|
|b|other noun-modifier|大型 (large size)|
|c|conjunction|虽然 (although)|
|d|adverb|很 (very)|
|e|exclamation|嗯 (Emmm)|
|g|morpheme|茨|
|h|prefix|阿|
|i|idiom|百花齐放|
|j|abbreviation|公检法|
|k|suffix|率|
|m|number|第一 (the first)|
|n|general noun|苹果 (apple)|
|nd|direction noun|右侧 (right)|
|nh|person name|汤姆 (Tom)|
|ni|organization name|保险公司 (insurance company)|
|nl|location noun|城市 (city)|
|ns|geographical name|北京 (Beijing)|
|nt|temporal noun|近日 (recently)|
|nz|other proper noun|诺贝尔奖 (Nobel Prize)|
|o|onomatopoeia|当啷 (clank)|
|p|preposition|让 (let)|
|q|quantity|张 (piece)|
|r|pronoun|我们 (we)|
|u|auxiliary|的 ('s)|
|v|verb|学习 (study)|
|wp|punctuation|。|
|ws|foreign words|CPU|
|x|non-lexeme|萄|
|z|descriptive words|瑟瑟|

And the true pos tags of the Chinese words (may different from English) are:

|Chinese|English|Tag|
|:-:|:-:|:-:|
|1981年|In 1981|nt|
|，|,|wp|
|哥谭市|Gotham|ns|
|充满|is rife with|v|
|失业|unemployment|v|
|、|and|wp|
|犯罪|crime|v|
|，|,|wp|
|导致|leaving|v|
|许多|swathes of|m|
|人|the population|n|
|穷困潦倒|impoverished|i|
|、|and|wp|
|失去|lost of|v|
|基本|basic|a|
|权利|rights|n|
|。|.|wp|

The pos tags given by pyLTP are:

In [6]:
pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model')

from pyltp import Postagger

postagger = Postagger()
postagger.load(pos_model_path)

postags = postagger.postag(words)

for i in range(len(words)):
    print(words[i] + "\t:" + postags[i])

postagger.release()

1981年	:nt
，	:wp
哥谭市	:ns
充满	:v
失业	:v
、	:wp
犯罪	:v
，	:wp
导致	:v
许多	:m
人	:n
穷困潦倒	:i
、	:wp
失去	:v
基本	:a
权利	:n
。	:wp


As shown above, the pos tagger works pretty well.

### Syntax Dependency

Once we got the word segments and the pos tags, we can use pyLTP detect the syntax dependencies base on trained model.

The tags table of syntax dependencies is shown as below:

|Tag|Description|Example|
|:-:|:-:|:-:|
|SBV|subject-verb|I <-- give|
|VOB|verb-object|give --> apple|
|IOB|indirect-object|give --> you|
|FOB|fronting-object|book <-- read|
|DBL|double|buy --> me|
|ATT|attribute|red <-- apple|
|ADV|adverbial|very <-- beautiful|
|CMP|complement|finish --> all|
|COO|coordinate|sea --> ocean|
|POB|preposition-object|on --> top|
|LAD|left adjunct|and <-- ocean|
|RAD|right adjunct|child --> ren|
|IS|independent structure|independent sentences|
|HED|head|core of the sentence|

And the syntax dependencies given by pyLTP are:

In [7]:
par_model_path = os.path.join(LTP_DATA_DIR, 'parser.model')

from pyltp import Parser

parser = Parser()
parser.load(par_model_path)

arcs = parser.parse(words, postags)

for i in range(len(words)):
    print((words[i] + "\t(" + en_words[i] + ")").ljust(22),end='')
    print("\t---" + arcs[i].relation.ljust(3) + "-->", end='\t')
    print((words[arcs[i].head-1] + "\t(" + en_words[arcs[i].head-1] + ")").ljust(22))

parser.release()

1981年	(In 1981)       	---ADV-->	充满	(is rife with)     
，	(,)                 	---WP -->	1981年	(In 1981)       
哥谭市	(Gotham)          	---SBV-->	充满	(is rife with)     
充满	(is rife with)     	---HED-->	。	(.)                 
失业	(unemployment)     	---VOB-->	充满	(is rife with)     
、	(and)               	---WP -->	犯罪	(crime)            
犯罪	(crime)            	---COO-->	失业	(unemployment)     
，	(,)                 	---WP -->	充满	(is rife with)     
导致	(leaving)          	---COO-->	充满	(is rife with)     
许多	(swathes of)       	---ATT-->	人	(the population)    
人	(the population)    	---SBV-->	穷困潦倒	(impoverished)   
穷困潦倒	(impoverished)   	---VOB-->	导致	(leaving)          
、	(and)               	---WP -->	失去	(lost of)          
失去	(lost of)          	---COO-->	穷困潦倒	(impoverished)   
基本	(basic)            	---ATT-->	权利	(rights)           
权利	(rights)           	---VOB-->	失去	(lost of)          
。	(.)                 	---WP -->	充满	(is rife with)     


### Sementic Role Label

Based on the trained model, word segments, pos tags and syntac dependencies, pyLTP can recognize the sementic roles in the sentence.

The labels of sementic roles are shown in table below:

|Sementic Role Label|Description|
|:-:|:-:|
|ADV|adverbial, default tag|
|BNE|beneﬁciary|
|CND|condition|
|DIR|direction|
|DGR|degree|
|EXT|extent|
|FRQ|frequency|
|LOC|locative|
|MNR|manner|
|PRP|purpose or reason|
|TMP|temporal|
|TPC|topic|
|CRD|coordinated arguments|
|PRD|predicate|
|PSR|possessor|
|PSE|possessee|

And the sementic roles detected by pyLTP from the sentence of Gotham are:

In [8]:
srl_model_path = os.path.join(LTP_DATA_DIR, 'pisrl.model')

from pyltp import SementicRoleLabeller

labeller = SementicRoleLabeller()
labeller.load(srl_model_path)

roles = labeller.label(words, postags, arcs)

for role in roles:
    print("Role: " + words[role.index] + " (" + en_words[role.index] + ")")
    for arg in role.arguments:
        print("\t" + (arg.name + ":").rjust(5),end=' ')
        for i in range(arg.range.start, arg.range.end + 1):
            print(words[i],end='')
        print(" (",end='')
        for i in range(arg.range.start, arg.range.end + 1):
            print(en_words[i],end=' ')
        print("\b)")
    print()

labeller.release()

Role: 充满 (is rife with)
	 TMP: 1981年， (In 1981 ,)
	  A0: 哥谭市 (Gotham)
	  A1: 失业、犯罪 (unemployment and crime)

Role: 导致 (leaving)
	 TMP: 1981年， (In 1981 ,)
	  A0: 哥谭市 (Gotham)

Role: 穷困潦倒 (impoverished)
	  A0: 许多人 (swathes of the population)

Role: 失去 (lost of)
	  A0: 许多人 (swathes of the population)
	  A1: 基本权利 (basic rights)

