# TOKENIZATION

NLP includes a great variety of procedures. Tokenization is one of them. The main task is to split a sequence of characters into units, called tokens. Tokens are usually represented by words, numbers, or punctuation marks. Sometimes, they can be represented by sentences or morphemes (word parts). Tokenization is the first step in text preprocessing. It is a very important procedure; before going to more sophisticated NLP procedures, we need to identify words that can help us interpret the meaning.



# Tokenization in NLTK

NLTK has the tokenize module that consists of different sub-modules. We will take a look at the most significant ones. The chart below describes some of them. The first column contains the names of tokenizers. To import a particular one, use from nltk.tokenize import <tokenizer>. Here are some examples of importing:



In [6]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/volkan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

- word_tokenize()                                 Returns word and punctuation tokens.

- WordPunctTokenizer()       Returns tokens from a string of alphabetic or non-alphabetic characters (like integers, $, @...).

- regexp_tokenize()          Returns tokens using standard regular expressions.

- TreebankWordTokenizer()    Returns the tokens as in the Penn Treebank using regular expressions.

- sent_tokenize()            Returns tokenized sentences

# Word tokenization

Let's take a look at an example. Imagine we have a string of three sentences:

In [1]:
text = "I have got a cat. My cat's name is C-3PO. He's golden."


Now, let's have a look at each tokenization method from the table. Don't forget to import all of them in advance.
In the example below, we pass the text variable to the word_tokenize() method:




In [7]:
print(word_tokenize(text))


['I', 'have', 'got', 'a', 'cat', '.', 'My', 'cat', "'s", 'name', 'is', 'C-3PO', '.', 'He', "'s", 'golden', '.']


The result is a list of strings (tokens). The function splits the string into words and punctuation marks. Mind the possessives and the contractions. The tokenizer transformes all 's into separate words. Of course, we understand that cat's could also be recognized as one token.

The next code snippet introduces the WordPunctTokenizer(). This tokenizer is similar to the first one, but the result is a little bit different. All the punctuation marks including dashes and apostrophes are separate tokens. Now, C-3PO, the cat's name, is split into three tokens. In this case, this behavior is not optimal.

In [11]:
wpt = nltk.WordPunctTokenizer()
print(wpt.tokenize(text))

['I', 'have', 'got', 'a', 'cat', '.', 'My', 'cat', "'", 's', 'name', 'is', 'C', '-', '3PO', '.', 'He', "'", 's', 'golden', '.']


In [12]:
ybw = nltk.TreebankWordDetokenizer()
print(ybw.tokenize(text))

I   h a v e   g o t   a   c a t .   M y   c a t' s   n a m e   i s   C - 3 P O .   H e' s   g o l d e n.


The TreebankWordTokenizer() works almost the same way as the word_tokenize(). Mind full stops – they form a token with the previous word, but the last full stop is a separate token. Word_tokenize(), on the contrary, recognizes full stops as separate tokens in all cases. Moreover, the apostrophe and s are not separated as with WordPunctTokenizer().

Let's now move on to the next method. The regexp_tokenize() function uses regular expressions and accepts two arguments: a string and a pattern for tokens.

In [13]:
# 1 
print(nltk.regexp_tokenize(text, "[A-z]+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', 'cat', 's', 'name', 'is', 'C', 'PO', 'He', 's', 'golden']

# 2
print(nltk.regexp_tokenize(text, "[0-9A-z]+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', 'cat', 's', 'name', 'is', 'C', '3PO', 'He', 's', 'golden']

# 3
print(nltk.regexp_tokenize(text, "[0-9A-z']+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', "cat's", 'name', 'is', 'C', '3PO', "He's", 'golden']

# 4
print(nltk.regexp_tokenize(text, "[0-9A-z'\-]+"))
# ['I', 'have', 'got', 'a', 'cat', 'My', "cat's", 'name', 'is', 'C-3PO', "He's", 'golden']

['I', 'have', 'got', 'a', 'cat', 'My', 'cat', 's', 'name', 'is', 'C', 'PO', 'He', 's', 'golden']
['I', 'have', 'got', 'a', 'cat', 'My', 'cat', 's', 'name', 'is', 'C', '3PO', 'He', 's', 'golden']
['I', 'have', 'got', 'a', 'cat', 'My', "cat's", 'name', 'is', 'C', '3PO', "He's", 'golden']
['I', 'have', 'got', 'a', 'cat', 'My', "cat's", 'name', 'is', 'C-3PO', "He's", 'golden']


# SENTENCE TOKENIZATION
let's look at the sent_tokenize() module. It splits a string into sentences:


In [14]:
print(nltk.sent_tokenize(text))
# ['I have got a cat.', "My cat's name is C-3PO.", "He's golden."]

['I have got a cat.', "My cat's name is C-3PO.", "He's golden."]


In [15]:
text_2 = "Mrs. Beam lives in the U.S.A., it is her motherland. She lost about 9 kilos (20 lbs.) last year."
print(sent_tokenize(text_2))
# ['Mrs. Beam lives in the U.S.A., it is her motherland.', 'She lost about 9 kilos (20 lbs.)', 'last year.']

['Mrs. Beam lives in the U.S.A., it is her motherland.', 'She lost about 9 kilos (20 lbs.)', 'last year.']


The sent_tokenize() includes a list of typical abbreviations and contractions with dots, so they are not recognized as the end of a sentence. Sometimes, it still provides confusing results. For example, after tokenizing the text_2 above, .) was recognized as the end of the sentence. It is a mistake. The last part in the tokenizer output is 'last year.' but it should belong to the previous sentence.

In [16]:
text_3 = "The plot of the film is cool!!!!!!! but the characters leave much to be desired....i don't like them."
print(sent_tokenize(text_3))
# ['The plot of the film is cool!!!!!!!', "but the characters leave much to be desired....i don't like them."]

['The plot of the film is cool!!!!!!!', "but the characters leave much to be desired....i don't like them."]
