# Word Tokenization

In [1]:
# Capture from https://en.wikipedia.org/wiki/Natural_language_processing

document = "Natural language processing (NLP) is a subfield of linguistics, computer science, information \
engineering, and artificial intelligence concerned with the interactions between computers and human (natural) \
languages, in particular how to program computers to process and analyze large amounts of natural language data. \
Challenges in natural language processing frequently involve speech recognition, natural language understanding, \
and natural language generation."

document_ = "ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456"

## Using spaCy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

More details: https://spacy.io/usage/spacy-101

In this section, you’ll use spaCy for a given input string and a text file. Load the language model instance in spaCy:

In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

Here, the `nlp` object is a language model instance. You can assume that, throughout this tutorial, `nlp` refers to the language model loaded by `en_core_web_sm`. Now you can use spaCy to read a string or a text file.

In [4]:
doc = nlp(document)

In spaCy, you can print tokens by iterating on the Doc object:

In [5]:
tokens = [token.text for token in doc]

In [7]:
print("Original Document: \n{}".format(document))
print("===" * 10)
print(tokens)

Original Document: 
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'Challenges', 'in', 'natural', 'lan

In [8]:
doc = nlp(document_)
tokens = [token.text for token in doc]
print("Original Document: \n{}".format(document_))
print("===" * 10)
print(tokens)

Original Document: 
ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456
['ConcateStringAnd123', 'ConcateSepcialCharacter_!@', '#', '!', '@#$%^&*()_+', '0123456']


First step of spaCy separates word by space and then applying some guidelines such as exception rule, prefix, suffix etc.

## Using NLTK

In [9]:
import nltk

In [10]:
print("Original Document: \n{}".format(document))
print("===" * 10)
print(nltk.word_tokenize(document))

Original Document: 
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'Challenges', 'in', 'natural', 'lan

In [11]:
print("Original Document: \n{}".format(document_))
print("===" * 10)
print(nltk.word_tokenize(document_))

Original Document: 
ConcateStringAnd123 ConcateSepcialCharacter_!@# !@#$%^&*()_+ 0123456
['ConcateStringAnd123', 'ConcateSepcialCharacter_', '!', '@', '#', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '_+', '0123456']
