# Segmenting Chinese text

Typically the first task in Natural Language Processing (NLP) is segmentation - splitting a piece of text into a set of words. This is difficult to do with Chinese because while a word is typically composed of one, two, or three characters, the text is written without spaces. To accomplish this non-trivial task we will use a dedicated Python library called [Jieba](https://github.com/fxsjy/jieba). 

The following code gives examples from [Jieba's github repository](https://github.com/fxsjy/jieba) for using the library to segment Chinese text into individual words. Once the text is segmented you should be able to use many of the NLP tools provided in the [NLTK library](http://www.nltk.org/) and described in the [NLTK book](http://www.nltk.org/book/).

In [None]:
# start by importing the jieba library

import jieba

In [None]:
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 默认模式

seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))

In [None]:
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print(list(seg_list))

In [None]:
result = jieba.tokenize(u'永和服装饰品有限公司')
for tk in result:
    print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))