# 4.2 - Segmenting Chinese text

Typically the first task in Natural Language Processing (NLP) is segmentation - splitting a piece of text into a set of words. This is difficult to do with Chinese because while a single word may be composed of one, two, or three characters, the text is written without spaces. To accomplish this non-trivial task we will use a dedicated Python library called [Jieba](https://github.com/fxsjy/jieba). 

The following code gives examples from [Jieba's github repository](https://github.com/fxsjy/jieba) for segmenting Chinese text into individual words. Once the text is segmented you should be able to use many of the NLP tools provided in the [NLTK library](http://www.nltk.org/) and described in the [NLTK book](http://www.nltk.org/book/).

In [1]:
# start by importing the jieba library
import jieba

In [2]:
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 默认模式

seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 3.898 seconds.
Prefix dict has been built succesfully.


Full Mode: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
Default Mode: 我/ 来到/ 北京/ 清华大学
他, 来到, 了, 网易, 杭研, 大厦
小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, ，, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造


In [3]:
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print(list(seg_list))

['我', '来到', '北京', '清华大学']


In [4]:
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))

word 永和		 start: 0 		 end:2
word 服装		 start: 2 		 end:4
word 饰品		 start: 4 		 end:6
word 有限公司		 start: 6 		 end:10
