![Word Tokenization](http://youthvoices.net/sites/default/files/image/69585/sep/persuasive-landing-pages-words-have-power.jpg)

Source: http://youthvoices.net/discussion/will-you-1-powerful-words

# Word Tokenization
To tackle text related problem in Machine Learning area, tokenization is one of the common pre-processing. In this article, we will go through how we can handle work toeknization and sentence tokenization by using three libraries which are spaCy, NLTK and jieba (for Chinese word).

In [135]:
# Capture from https://en.wikipedia.org/wiki/Lexical_analysis

article = 'In computer science, lexical analysis, lexing or tokenization is the process of \
converting a sequence of characters (such as in a computer program or web page) into a \
sequence of tokens (strings with an assigned and thus identified meaning). A program that \
performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner \
is also a term for the first stage of a lexer. A lexer is generally combined with a parser, \
which together analyze the syntax of programming languages, web pages, and so forth.'

In [136]:
article2 = 'ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456'

In [137]:
article3 = '你的姿態 你的青睞 我存在在你的存在 你以為愛 就是被愛'

In [138]:
# Capture from https://zh.wikipedia.org/wiki/%E8%AF%8D%E6%B3%95%E5%88%86%E6%9E%90

article4 = '词法分析是计算机科学中将字符序列转换为标记序列的过程。进行词法分析的程序或者函数叫作词法分析器，也叫扫描器。词法分析器一般以函数的形式存在，供语法分析器调用。'

# spaCy

In [139]:
# !pip install spacy
import spacy
print('spaCy Version: %s' % spacy.__version__)

spaCy Version: 2.3.5


In [140]:
#!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
#!python -m spacy download en_core_web_sm

In [141]:
spacy_nlp = spacy.load('en_core_web_sm-2.3.1')

In [142]:
print('Original Article: %s' % (article))
print()
doc = spacy_nlp(article)
tokens = [token.text for token in doc]
print(tokens)

Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

['In', 'computer', 'science', ',', 'lexical', 'analysis', ',', 'lexing', 'or', 'tokenization', 'is', 'the', 'process', 'of', 'converting', 'a', 'sequence', 'of', 'characters', '(', 'such', 'as', 'in', 'a', 'computer', 'program', 'or', 'web', 'page', ')', 'into', 'a', 'sequence', 'of', 'tokens', '(', 'strings', 'with', 'an', 'assigned', 'and', 'thus', 'identified', 'meaning', ')', '.', 'A', 'program', 'that', 'performs', 'lexical', 'analysis', 'may', 'be

Not all special character will be seperated.

In [143]:
print('Original Article: %s' % (article2))
print()
doc = spacy_nlp(article2)
tokens = [token.text for token in doc]
print(tokens)

Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

['ConcateStringAnd123', 'ConcateSepcialCharacter_!@', '#', '!', '@#$%^&*()_+', '0123456']


First step of spaCy separates word by space and then applying some guidelines such as exception rule, prefix, suffix etc.

# NLTK

In [144]:
import nltk
#nltk.download('punkt')
print('NTLK Version: %s' % nltk.__version__)

NTLK Version: 3.4.4


In [145]:
print('Original Article: %s' % (article))
print()
print(nltk.word_tokenize(article))

Original Article: In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer,[1] or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

['In', 'computer', 'science', ',', 'lexical', 'analysis', ',', 'lexing', 'or', 'tokenization', 'is', 'the', 'process', 'of', 'converting', 'a', 'sequence', 'of', 'characters', '(', 'such', 'as', 'in', 'a', 'computer', 'program', 'or', 'web', 'page', ')', 'into', 'a', 'sequence', 'of', 'tokens', '(', 'strings', 'with', 'an', 'assigned', 'and', 'thus', 'identified', 'meaning', ')', '.', 'A', 'program', 'that', 'performs', 'lexical', 'analysis', 'may', 'be

Some special character (e.g. _) will not be seperated

In [147]:
print('Original Article: %s' % (article2))
print()
print(nltk.word_tokenize(article2))

Original Article: ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456

['ConcateStringAnd123', 'ConcateSepcialCharacter_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_+', '0123456']


The behavior is a little difference from spaCy. NLTK treats most of special character as a "word" except "_". Of course, number will be tokenized as well.

# jieba

In [148]:
#!pip install jieba
import jieba
print('jieba Version: %s' % jieba.__version__)

jieba Version: 0.42.1


In [149]:
print('Original Article: %s' % (article3))
print()

words = jieba.cut(article3, cut_all=False)
words = [str(word) for word in words]
print(words)

Original Article: 你的姿態 你的青睞 我存在在你的存在 你以為愛 就是被愛

['你', '的', '姿態', ' ', '你', '的', '青睞', ' ', '我', '存在', '在', '你', '的', '存在', ' ', '你', '以', '為', '愛', ' ', '就是', '被', '愛']


In [150]:
print('Original Article: %s' % (article4))
print()

words = jieba.cut(article4, cut_all=False)
words = [str(word) for word in words]
print(words)

Original Article: 词法分析是计算机科学中将字符序列转换为标记序列的过程。进行词法分析的程序或者函数叫作词法分析器，也叫扫描器。词法分析器一般以函数的形式存在，供语法分析器调用。

['词法', '分析', '是', '计算机科学', '中将', '字符', '序列', '转换', '为', '标记', '序列', '的', '过程', '。', '进行', '词法', '分析', '的', '程序', '或者', '函数', '叫作', '词法', '分析器', '，', '也', '叫', '扫描器', '。', '词法', '分析器', '一般', '以', '函数', '的', '形式', '存在', '，', '供', '语法分析', '器', '调用', '。']
