## Tokenization

Text preprocessing is one of the most important first steps in the process of using text for gaining insights. This is one of the most tiring and time consuming as well if we are not using the right tools and techniques. 

Python has wide range of tools available at our disposal for making this step a very easy to follow process. In this notebook we are going to look at the ways in which we can do tokenization. Tokenization is the process of turning the string into small chunks. These small chunks can be anything from words to sentences. All this depends on your usecase and how you are planning to preprocess the data for your text mining.

There are mutliple ways to perform word tokenization. Here we will look at some of the examples.

### Basic Tokenizer

In [6]:
### Simplest word tokenization using space to split up the text or document
text = "State delegate equivalents are calculated from the results of the second alignment at each caucus location."
text.split(' ')

['State',
 'delegate',
 'equivalents',
 'are',
 'calculated',
 'from',
 'the',
 'results',
 'of',
 'the',
 'second',
 'alignment',
 'at',
 'each',
 'caucus',
 'location.']

### NLTK word tokenizer

In [12]:
import nltk
#nltk.download('punkt')
from nltk.tokenize import word_tokenize

### NLTK Word Tokenizer more clean way of tokenizing the text
text = "State delegate equivalents are calculated from the results of the second alignment at each caucus location."
word_tokenize(text)

['State',
 'delegate',
 'equivalents',
 'are',
 'calculated',
 'from',
 'the',
 'results',
 'of',
 'the',
 'second',
 'alignment',
 'at',
 'each',
 'caucus',
 'location',
 '.']

NLTK word tokenizer helps us to tokenize the text and special characters separately which will make the further processer more easier. But this does not work well when we want to tokenize tweets which have hashtags and  we dont want the hashtags to be broken.

<b>Example:</b>


In [61]:
tweet = "#DelhiElections results will be help this week and results will be next week @India"

word_tokenize(tweet)

['#',
 'DelhiElections',
 'results',
 'will',
 'be',
 'help',
 'this',
 'week',
 'and',
 'results',
 'will',
 'be',
 'next',
 'week',
 '@',
 'India']

We do not want the hashtags be tokenized into two with '#' seperately. This does not make sense when analyzing the tweets. How do we tokenize the tweets ? NLTK has tokenizer specially built for tweets.

### NLTK Tweet tokenizer

Tweet tokenizer retains the hashtags and also the tweet handles.

In [62]:
from nltk.tokenize import TweetTokenizer

tknz = TweetTokenizer()
tknz.tokenize(tweet)

['#DelhiElections',
 'results',
 'will',
 'be',
 'help',
 'this',
 'week',
 'and',
 'results',
 'will',
 'be',
 'next',
 'week',
 '@India']

### Tokenizing using Regular Expression

How do I tokenize and extract only the alpha characters from the tweets?

In [63]:
from nltk.tokenize import RegexpTokenizer

RegexpTokenizer('[a-zA-Z]\w+').tokenize(tweet)

['DelhiElections',
 'results',
 'will',
 'be',
 'help',
 'this',
 'week',
 'and',
 'results',
 'will',
 'be',
 'next',
 'week',
 'India']

How to extract only the hashtags using regular expression?

In [64]:
RegexpTokenizer('^#[a-zA-Z]\w+').tokenize(tweet)

['#DelhiElections']

In [65]:
RegexpTokenizer('@[a-zA-Z]\w+').tokenize(tweet)

['@India']

#### Sentence Tokenization

sent_tokenize from nltk helps us to tokenize the document into sentences. This is very different from the word tokenize.

In [47]:
from nltk.tokenize import sent_tokenize
sentences = "Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages."
sent_tokenize(sentences)

['Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing.',
 'The same words in a different order can mean something completely different.',
 'Even splitting text into useful word-like units can be difficult in many languages.']

### Removing Stop words and Special Characters

Stop words are nothing but basic words used in most of the sentences in a given language. 

In [83]:
#nltk.download('stopwords')
from nltk.corpus import stopwords

[w for w in word_tokenize(sentences) if w not in set(stopwords.words('english')) if w.isalpha()]

['Processing',
 'raw',
 'text',
 'intelligently',
 'difficult',
 'words',
 'rare',
 'common',
 'words',
 'look',
 'completely',
 'different',
 'mean',
 'almost',
 'thing',
 'The',
 'words',
 'different',
 'order',
 'mean',
 'something',
 'completely',
 'different',
 'Even',
 'splitting',
 'text',
 'useful',
 'units',
 'difficult',
 'many',
 'languages']

In this notebook we have seen how we can do tokenization of the text documents. There are multiple ways we can do it. 