##### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020

# Spotlight Jieba Parsing Package Usage in Information Retrieve
#### submitted by Chen BingXu

## 1 Introduction:
"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.  
`If you're face with some Chinese text to deal with, you can use this package to do something, like finding the keywords or find the top files you want to find.By the way, I'm sorry to say that this package can use to deal with English, but not very good.`  
`(there are some Chinese example in this notebook, if you find hard to read about it, you can use some package to convert the example to other languages`

## 2 Features:
- Support three types of segmentation mode:  
1.Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.  
2.Full Mode gets all the possible words from the sentence. Fast but not accurate.Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise  
3.the recall rate. Suitable for search engines.  
- Supports Traditional Chinese
- Supports customized dictionaries

## 3 Algorithm:
- Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
- Use dynamic programming to find the most probable combination based on the word frequency.
- For unknown words, a HMM-based model is used with the Viterbi algorithm.

## 4 Usage
- Fully automatic installation: easy_install jieba or pip install jieba
- Semi-automatic installation: Download http://pypi.python.org/pypi/jieba/ , run python setup.py install after extracting.
- Manual installation: place the jieba directory in the current directory or python site-packages directory.
import jieba.

#  Main Function

## 1 Cut (Parsing)
- The `jieba.cut` function accepts three input parameters: the first parameter is the string to be cut; the second parameter is `cut_all`, controlling the cut mode; the third parameter is to control whether to use the Hidden Markov Model.
- `jieba.cut_for_search` accepts two parameter: the string to be cut; whether to use the Hidden Markov Model. This will cut the sentence into short words suitable for search engines.
- The input string can be an unicode/str object, or a str/bytes object which is encoded in UTF-8 or GBK. Note that using GBK encoding is not recommended because it may be unexpectly decoded as UTF-8.
- `jieba.cut` and `jieba.cut_for_search` returns an generator, from which you can use a `for` loop to get the segmentation result (in unicode).
- `jieba.lcut` and `jieba.lcut_for_search` returns a list.
- `jieba.Tokenizer(dictionary=DEFAULT_DICT)` creates a new customized Tokenizer, which enables you to use different dictionaries at the same time. `jieba.dt` is the default Tokenizer, to which almost all global functions are mapped.

### Implement(Chinese)

In [3]:
import jieba

#encoding=utf-8
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 默认模式

seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
Default Mode: 我/ 来到/ 北京/ 清华大学
他, 来到, 了, 网易, 杭研, 大厦
小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ，, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造


### Implement(English)

In [8]:
data=[
   "世界经济论坛也叫达沃斯论坛。",
   "The World Economic Forum is also called the Davos Forum."
    ]

for d in data:
    seg_list = jieba.cut(d)
    
    print(",".join(seg_list))

世界,经济,论坛,也,叫,达沃斯,论坛,。
The, ,World, ,Economic, ,Forum, ,is, ,also, ,called, ,the, ,Davos, ,Forum,.


## 2 Modify dictionary
- Use `add_word(word, freq=None, tag=None)` and `del_word(word)` to modify the dictionary dynamically in programs.

- Use `suggest_freq(segment, tune=True)` to adjust the frequency of a single word so that it can (or cannot) be segmented.

- Note that HMM may affect the final result.


### Implement(Chinese and English) Compare to the result of `Cut` above

In [26]:
jieba.add_word('世界经济论坛')
jieba.add_word('达沃斯论坛')
jieba.add_word('World Economic Forum')
jieba.add_word('Davos Forum')

data=[
   "世界经济论坛也叫达沃斯论坛。",
   "The World Economic Forum is also called the Davos Forum."
    ]

for d in data:
    seg_list = jieba.cut(d)
    
    print(",".join(seg_list))

世界经济论坛,也,叫,达沃斯论坛,。
The, ,World Economic Forum, ,is, ,also, ,called, ,the, ,Davos Forum,.


## 3 Keyword Extraction(base on TF-IDF)
`import jieba.analyse`

- jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
    * sentence: the text to be extracted
    * topK: return how many keywords with the highest TF/IDF weights. The default value is 20
    * withWeight: whether return TF/IDF weights with the keywords. The default value is False
    * allowPOS: filter words with which POSs are included. Empty for no filtering.
- jieba.analyse.TFIDF(idf_path=None) creates a new TFIDF instance, idf_path specifies IDF file path.

In [27]:
#Keyword Extraction

import jieba.analyse

kWords = jieba.analyse.extract_tags("此外，公司拟对全资子公司吉林欧亚置业有限公司增资4.3亿元，增资后，吉林欧亚置业注册资本由7000万元增加到5亿元。>吉林欧亚置业主要经营范围为房地产开发及百货零售等业务。目前在建吉林欧亚城>市商业综合体项目。2013年，实现营业收入0万元，实现净利润-139.13万元。",topK=5,withWeight=True)

for word,weight in kWords:
   # print(word+":"+weight)
   print(word,":",weight)

欧亚 : 0.7458841454643479
吉林 : 0.6733651014252174
置业 : 0.49933765769413047
万元 : 0.3466477318421739
增资 : 0.3431245420230435


Keyword Extraction(base on TextRank) <--------> Compare with TF-IDF

In [35]:
import jieba.analyse

content = u'会议邀请到美国密歇根大学(University of Michigan, Ann Arbor）环境健康科学系副教授奚传武博士作题为“Multibarrier approach for safe drinking waterin the US : Why it failed in Flint”的学术讲座，介绍美国密歇根Flint市饮用水污染事故的发生发展和处置等方面内容。讲座后各相关单位同志与奚传武教授就生活饮用水在线监测系统、美国水污染事件的处置方式、生活饮用水老旧管网改造、如何有效减少消毒副产物以及美国涉水产品和二次供水单位的监管模式等问题进行了探讨和交流。本次交流会是我市生活饮用水卫生管理工作洽商机制运行以来的又一次新尝试，也为我市卫生计生综合监督部门探索生活饮用水卫生安全管理模式及突发水污染事件的应对措施开拓了眼界和思路。'

#base on TF-IDF
keywords = jieba.analyse.extract_tags(content,topK = 5,withWeight = True,allowPOS = ('n','nr','ns'))
for item in keywords:
    print(item[0],item[1])

饮用水 1.062341659448913
奚传武 0.5197725001260869
讲座 0.3842644616934783
美国 0.27108236293434784
科学系 0.2672008639021739


In [37]:
#base on TextRank
keywords = jieba.analyse.textrank(content,topK = 5,withWeight = True,allowPOS = ('n','nr','ns'))
for item in keywords:
    print(item[0],item[1])

饮用水 1.0
美国 0.5705647859730464
奚传武 0.5107384245087675
单位 0.4728418893343898
讲座 0.4437707320528082


## 4 Part of Speech Tagging
- jieba.posseg.POSTokenizer(tokenizer=None) creates a new customized Tokenizer. tokenizer specifies the jieba.Tokenizer to internally use. jieba.posseg.dt is the default POSTokenizer.
- Tags the POS of each word after segmentation, using labels compatible with ictclas.

In [38]:
import jieba.posseg as pseg
words = pseg.cut("我爱北京天安门")
for word, flag in words:
    print('%s %s' % (word, flag))

我 r
爱 v
北京 ns
天安门 ns


When you dealing with an English sequences, it can't help you to tag the word.But when you're doing a sequences contain Chinese and English, it can help you to tag the English word as eng  
Here is the example

In [39]:
import jieba.posseg as pseg
words = pseg.cut("我爱打篮球 The World Economic Forum is also called the Davos Forum")
for word, flag in words:
    print('%s %s' % (word, flag))

我 r
爱 v
打篮球 n
  x
The eng
  x
World eng
  x
Economic eng
  x
Forum eng
  x
is eng
  x
also eng
  x
called eng
  x
the eng
  x
Davos eng
  x
Forum eng


## 5 Tokenize: return words with position
- The input must be unicode


Default mode:

In [41]:
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限公司		 start: 6 		 end:10


Search mode:

In [42]:
result = jieba.tokenize(u'永和服装饰品有限公司',mode='search')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限		 start: 6 		 end:8
word 公司		 start: 8 		 end:10
word 有限公司		 start: 6 		 end:10


## 6 An Search Engine Whoosh Using jieba package

In [4]:
# coding=utf-8
import os
from whoosh.index import create_in
from whoosh.fields import *
from jieba.analyse import ChineseAnalyzer
import json

analyzer = ChineseAnalyzer()

# creat schema, stored is True reprsent that it can be searched
schema = Schema(title=TEXT(stored=True, analyzer=analyzer), path=ID(stored=False),
                content=TEXT(stored=True, analyzer=analyzer))
indexdir = 'indexdir/'
if not os.path.exists(indexdir):
    os.mkdir(indexdir)
ix = create_in(indexdir, schema)

#String format need to be unicode
writer = ix.writer()
writer.add_document(title=u"第一篇文档", path=u"/a",
                    content=u"这是我们增加的第一篇文档")
writer.add_document(title=u"第二篇文档", path=u"/b",
                    content=u"第二篇文档也很interesting！")
writer.commit()

searcher = ix.searcher()

results = searcher.find("title", u"文档")

#The first result data structure is dict{'title':.., 'content':...}
firstdoc = results[0].fields()

jsondoc = json.dumps(firstdoc, ensure_ascii=False)

print(jsondoc) 
print(results[0].highlights("title"))
print(results[0].score)  # bm25 score

{"content": "这是我们增加的第一篇文档", "title": "第一篇文档"}
第一篇<b class="match term0">文档</b>
0.5945348918918356


# conclusion

From studying, I have studied that there are many python package now for parsing or doing some information function.For example, jieba parsing package is good for parsing and searching key word when facing with a Chinses file;NLTK is good for users doing parsing or other function when facing with english file.So you need to apply different package when facing with different language file or combine/mix dfferent when you facing with the file that contain many languages.

# Resources

If you are interested in learning more about it, here is the document of it:https://github.com/fxsjy/jieba