# Building Your NLP Vocabulary


### Tokenization
In order to build up a vocabulary, the first thing to do is to break the documents or sentences into chunks called tokens. Each token carries a semantic meaning associated with it. Tokenization is one of the fundamental things to do in any text-processing activity. Tokenization can be thought of as a segmentation technique wherein you are trying to break down larger pieces of text chunks into smaller meaningful ones.

In [2]:
sentence = "The capital of India is Delhi"
sentence.split()

['The', 'capital', 'of', 'India', 'is', 'Delhi']

In [3]:
sentence = "India's capital is Delhi"
sentence.split()

["India's", 'capital', 'is', 'Delhi']

In the preceding example, should it be India, Indias, or India's? A split method does
not often know how to deal with situations containing apostrophes.

### Different types of tokenizers
1. **Regular expression based tokenizer**
2. **Treebank tokenizer**
3. **TweetTokenizer**

In [4]:
from nltk.tokenize import RegexpTokenizer
sentence = "A Rolex watch costs in the range of $3000.0 - $8000.0 in the USA"
tokenizer = RegexpTokenizer('\w+|\$[d\.]+|\S+')
tokenizer.tokenize(sentence)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'the',
 'USA']

The `\w+|\$[\d\.]+|\S+` regular expression allows three alternative patterns:

**First alternative:** `\w+` that matches any word character (equal to [a-zA-Z0-9_]). The + is a quantifier and matches between one and unlimited times as many times as possible. <br>
**Second alternative:** `\$[\d\.]+`.  Here, `\$` matches the character $, \d matches a digit between 0 and 9, \. matches the character . (period), and + again acts as a quantifier matching between one and unlimited times. <br>
**Third alternative:** `\S+`. Here, \S accepts any non-whitespace character and + again acts the same way as in the preceding two alternatives.

There are other tokenizers built on top of the RegexpTokenizer, such as the BlankLine tokenizer, which tokenizes a string treating blank lines as delimiters where blank lines are those that contain no characters except spaces or tabs.
The WordPunct tokenizer is another implementation on top of RegexpTokenizer, which tokenizes a text into a sequence of alphabetic and nonalphabetic characters using the regular expression \w+|[^\w\s]+.

In [6]:
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

from nltk.tokenize import BlanklineTokenizer
BlanklineTokenizer().tokenize(s)

['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.',
 'Thanks.']

In [7]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize

regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [8]:
wordpunct_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3',
 '.',
 '88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [9]:
blankline_tokenize(s)

['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.',
 'Thanks.']

**The Treebank tokenizer** does a great job of splitting contractions such as doesn't to does and n't. It further identifies periods at the ends of lines and eliminates them. Punctuation such as commas is split if followed by whitespaces.

In [10]:
from nltk.tokenize import TreebankWordTokenizer
s = "I'm going to buy a Rolex watch that doesn't cost more than $3000.0"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'that',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

the rise of social media has given rise to an informal language wherein people tag each other using their social media handles and use a lot of emoticons, hashtags, and abbreviated text to express themselves. We need tokenizers in place that can parse such text and make things more understandable. **TweetTokenizer** caters to this use case significantly. 

In [11]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

In [12]:
#reduce length of Rolexxxxxxx
tokenizer = TweetTokenizer(strip_handles=True,
                           reduce_len = True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

The parameter strip_handles, when set to True, removes the handles mentioned in a post/tweet. As can be seen in the preceding output, @amankedia is stripped, since it is a handle.

One more parameter that is available with TweetTokenizer is preserve_case, which, when set to False, converts everything to lower case in order to normalize the vocabulary. The default value for this parameter is True.