# ECO6128 Tutorial - Jieba

"Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation module.

## Features

- Support three types of segmentation mode:

1. Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis.

2. Full Mode gets all the possible words from the sentence. Fast but not accurate.

3. Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

- Supports Traditional Chinese

- Supports customized dictionaries

- MIT License

Created by *Xinghao YU*, March 18th, 2023, for more please refer to [JIEBA](https://github.com/fxsjy/jieba/blob/master/README.md)

*Copyright@Chinese University of Hong Kong, Shenzhen*

## Installation

In [None]:
# Anaconda Prompt
!conda install jieba

# Default Terminal Prompt
!pip install jieba

## Segmentation

### Introduction of CUT function

*cut(sentence, cut_all=False, HMM=True)* -- return generator

*lcut(sentence)* -- return word list

In [None]:
import jieba

sentence = '我爱自然语言处理'

# 创建[Tokenizer.cut 生成器]对象
generator = jieba.cut(sentence)

# 遍历生成器，打印分词结果
words = '/'.join(generator)
print(words)

# 打印列表
print(jieba.lcut('我爱香港中文大学（深圳）'))

### Different modes

In [None]:
sentence = '订单数据分析'

print('精准模式：', jieba.lcut(sentence))
print('全模式：', jieba.lcut(sentence, cut_all=True))
print('搜索引擎模式：', jieba.lcut_for_search(sentence))

### Part of Speech Tagging

*jieba.posseg.POSTokenizer(tokenizer=None)* creates a new customized Tokenizer. *tokenizer* specifies the *jieba.Tokenizer* to internally use. *jieba.posseg.dt* is the default POSTokenizer. Tags the POS of each word after segmentation, using labels compatible with ictclas.

In [None]:
import jieba.posseg as pseg

sentence = '我爱Python数据分析'
posseg = pseg.cut(sentence)
for i in posseg:
    print(i.__dict__)
    print(i.word, i.flag)

### Tokenize: return words with position

In [None]:
result = jieba.tokenize('永和服装饰品有限公司')

for tk in result:
    print("word %s\t start: %d \t end:%d" % (tk[0],tk[1],tk[2]))

## Modify Dictionary

add word to the default dictionary, 
    
    add_word(word, freq=None, tag=None)

remove specific word from the dictionary, equals to *add_word(word, freq=0)*
    
    del_word(word)

In [None]:
sentence = '天长地久有时尽，此恨绵绵无绝期'

# add
jieba.add_word('时尽', 999, 'nz')
print('添加[时尽]：', jieba.lcut(sentence))

# remove
jieba.del_word('时尽')
print('删除[时尽]：', jieba.lcut(sentence))