## Exploring Tokenization

Tokenization is one of the fundamental things to do in any text-processing activity. Tokens generally comprise words and numbers, but they can be extended to include punctuaction marks, symbols, and, at times, understandable emoticons. Here some `nltk` class are shown.

### The simple way

The `split()` function can provide you with all the different tokens in the sentence. Each token carries an intrinsic meaning.

In [1]:
sentence1 = "The capital of China is Beijing"
sentence1.split()

['The', 'capital', 'of', 'China', 'is', 'Beijing']

In [2]:
sentence2 = "China's capital is Beijing"
sentence2.split()

["China's", 'capital', 'is', 'Beijing']

The `sentence2` sentence contains the apostrophe character. The `split()` function does not often know how to deal with situations containing apostrophes.

In [3]:
sentence3 = "Beijing is where we'll go"
sentence3.split()

['Beijing', 'is', 'where', "we'll", 'go']

In [4]:
sentence4 = "I'm going to travel to Beijing"
sentence4.split()

["I'm", 'going', 'to', 'travel', 'to', 'Beijing']

In [5]:
sentence5 = "Most of the times umm I travel"
sentence5.split()

['Most', 'of', 'the', 'times', 'umm', 'I', 'travel']

The `sentence5` contains `umm` term. It should be one token, but it does not contain any meaning. But it could represent a pause done by a person during a speech.

In [6]:
sentence6 = "Let's travel to Hong Kong from Beijing"
sentence6.split()

["Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing']

The `sentence6` contains `Hong Kong` term. It should be one token.

In [7]:
sentence7 = "A friend is pursuing his M.S from Beijing"
sentence7.split()

['A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing']

The `sentence7` contains `M.S` term. It should be one token.

In [8]:
sentence8 = "Beijing is a cool place!!! :-P <3 #Awesome"
sentence8.split()

['Beijing', 'is', 'a', 'cool', 'place!!!', ':-P', '<3', '#Awesome']

The `sentence8` contains an emoticon and a hashtag, quite common in social media.

### Regexp Tokenizer

The `nltk` library provides a regular expression-based tokenizers functionally. It can be used to tokenize or split a sentence based on a provided regular expression. 

Define a regular expression to get money, alphabetic sequences and abbreviation.

In [9]:
from nltk.tokenize import RegexpTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

The `\w+|\$[\d\.]+|\S+` regular expression allows three alternative patterns:
- `\w+`: `\w` matches any word character (equal to [a-zA-Z0-9_]). The `+` acts as a quantifier and matches between one and unlimited times as many times as possible.
- `\$[\d\.]+`: `\$` matches the character $, `\d` matches a digit between 0 and 9, `\.` matches the character `.`, and `+` acts as a quintifier matching between one and unlimited times.
- `\S+`: `\S` accepts any non-whitespace character. `+` acts as a quintifier matching between one and unlimited times. 

### Blankline Tokenizer

The `nltk` library provides a blank line tokenizers functionally. It can be used to tokenize a string, treating any sequence of blank lines as a delimiter. 

Blank lines are defined as lines containing no characters, except for space or tab characters.

In [10]:
from nltk.tokenize import BlanklineTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n\n I want a book as well"
tokenizer = BlanklineTokenizer()
tokenizer.tokenize(s)

['A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.',
 'I want a book as well']

### WordPunct Tokenizer

The `nltk` library provides a word punctuaction tokenizers functionally. It can be used to extract the tokens from string of words or sentences in the form of Alphabetic and Non-Alphabetic character.

In [11]:
from nltk.tokenize import WordPunctTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n I want a book as well"
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$',
 '3000',
 '.',
 '0',
 '-',
 '$',
 '8000',
 '.',
 '0',
 'in',
 'USA',
 '.',
 'I',
 'want',
 'a',
 'book',
 'as',
 'well']

### Tweet Tokenizer

With the help of `nltk` the `TweetTokenizer()` method converts the stream of words into small tokens. The method keeps hashtags intact.

In [12]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

Use the same method with the `strip_handles` and `reduce_len` parameters.

In [13]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']