# <center> Word Tokenization Techniques in NLP

## What is word tokenization?
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called **tokens**. These tokens help in understanding the context or developing the model for the NLP. 
The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’

Tokenization can be done to either separate words or sentences.
- If the text is split into words using some separation technique it is called **word tokenization** and same separation done for sentences is called **sentence tokenization.**

In [2]:
# Simple Example of word tokenization
from nltk.tokenize import word_tokenize

#input a text
text = "Hello everyone. Welcome to my Jupyter Notebook"
word_tokenize(text)  #print tokenized words

['Hello', 'everyone', '.', 'Welcome', 'to', 'my', 'Jupyter', 'Notebook']

In [3]:
#import sent_tokenize from nltk library
from nltk import sent_tokenize 
text = "Hello everyone. Welcome to the NLP Subject. Mr.Suyash  and Tejas Pandey are waiting for you. They'll join you soon."
for t in sent_tokenize(text):
    
    x =word_tokenize(t)
    print(x)

['Hello', 'everyone', '.']
['Welcome', 'to', 'the', 'NLP', 'Subject', '.']
['Mr.Suyash', 'and', 'Tejas', 'Pandey', 'are', 'waiting', 'for', 'you', '.']
['They', "'ll", 'join', 'you', 'soon', '.']


#### Stop words : 
Stop words are those words in the text which does not add any meaning to the sentence and their removal will not affect the processing of text for the defined purpose. They are removed from the vocabulary to reduce noise and to reduce the dimension of the feature set.

In [5]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a Vineet raj Parashar,Topper of my Batch"

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'Vineet', 'raj', 'Parashar', ',', 'Topper', 'of', 'my', 'Batch']
['This', 'Vineet', 'raj', 'Parashar', ',', 'Topper', 'Batch']


### Various Tokenization Techniques:

## 1) Whitespace Tokenization
This is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by splitting the input whenever a white space in encountered. This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words.
#### Syntax : tokenize.WhitespaceTokenizer()

In [6]:
import nltk

In [7]:
# You can also import WhitespaceTokenizer() method from nltk
from nltk.tokenize import WhitespaceTokenizer

In [9]:
# Create a reference variable for Class WhitespaceTokenizer
wtk = WhitespaceTokenizer()

#give string input
text1 = "Tejas is a best coder in juet guna"

#use tokenize method
tokens = wtk.tokenize(text1)
print(tokens)

['Tejas', 'is', 'a', 'best', 'coder', 'in', 'juet', 'guna']


## 2) Dictionary Based Tokenization
In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace tokenizer.

In [10]:
# create a list and dictionary variable
suy = ['Tomato', 'Orange','Almond']
kp = {'Vegetable': 'Tomato', 'Fruit': 'Orange','Dry-fruit':'Almond'}

#extract words from dictonary items
word2index = {key: val for val, key in kp.items()}

#print tokenized words
tokenized = [[word2index[word] for word in text.split()] for text in suy]
tokenized

[['Vegetable'], ['Fruit'], ['Dry-fruit']]

## 3) Rule Based Tokenization
In this technique a set of rules are created for the specific problem. The tokenization is done based on the rules. For example creating rules bases on grammar for particular language.

### Regular Expression Tokenizer
This technique uses regular expression to control the tokenization of text into tokens. Regular expression can be simple to complex and sometimes difficult to comprehend. This technique should be preferred when the above methods does not serve the required purpose. **It is a rule based tokenizer.**

In [11]:
# import RegexpTokenizer
from nltk.tokenize import RegexpTokenizer
  
tk = RegexpTokenizer("[\w']+")  #[\w']+ is one type of regular expression which extracts whole words from text.

#give an input string
text = "Let's see how it's working for GUI Python"
tokens = tk.tokenize(text)
tokens

["Let's", 'see', 'how', "it's", 'working', 'for', 'GUI', 'Python']

### Punctuation-based tokenizer
Punctuation-based tokenization splits on whitespace and punctuations and also retains the punctuations.Punctuation-based tokenization overcomes the issue above and provides a meaningful token.

In [12]:
#import wordpunct_tokenize from nltk
from nltk.tokenize import wordpunct_tokenize

text = "Mrs.Surbhi buys Fruits : Mango,Banana,Orange,Cheery "

tokens = wordpunct_tokenize(text)
tokens

['Mrs',
 '.',
 'Surbhi',
 'buys',
 'Fruits',
 ':',
 'Mango',
 ',',
 'Banana',
 ',',
 'Orange',
 ',',
 'Cheery']

### Tweet Tokenizer
Special texts, like Twitter tweets, have a characteristic structure and the generic tokenizers mentioned above fail to produce viable tokens when applied to these datasets. NLTK offers a special tokenizer for tweets to help in this case. This is a **rule-based tokenizer** that can remove HTML code, remove problematic characters, remove Twitter handles, and normalize text length by reducing the occurrence of repeated letters.

In [13]:
#import TweetTokenizer from nltk
from nltk.tokenize import TweetTokenizer

#create object of tokenizer
tknzr = TweetTokenizer(strip_handles=True)
tweet= " @NLP_learner: NLP is way tooo coool in Reading:-) :-P <3"

x= tknzr.tokenize(tweet)
print(x)

[':', 'NLP', 'is', 'way', 'tooo', 'coool', 'in', 'Reading', ':-)', ':-P', '<3']


### MWE(Multi-Word Expression) Tokenizer
The multi-word expression tokenizer is a rule-based, “add-on” tokenizer offered by NLTK. Once the text has been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.
- MWETokenizer takes a string and merges multi-word expressions into single tokens, using a lexicon of MWEs


In [14]:
from nltk.tokenize import MWETokenizer
   
# Create a reference variable for Class MWETokenizer
tk = MWETokenizer([('M', 'W', 'E'), ('Multi', 'Word', 'Tokenier')])
tk.add_mwe(('Natural', 'Language', 'Processing'))
   
# Create a string input
text = "What is M W E in Natural Language Processing"
   
# Use tokenize method
tokenized = tk.tokenize(text.split())
   
print(tokenized)

['What', 'is', 'M_W_E', 'in', 'Natural_Language_Processing']


## 4)Penn TreeBank/Default Tokenization
Tree bank is a corpus created which gives the semantic and syntactical annotation of language. Penn Treebank is one of the largest treebanks which was published. This technique of tokenization separates the punctuation, clitics (words that occur along with other words like I’m, don’t) and hyphenated words together.

In [15]:
#import tokenizer from nltk
from nltk.tokenize import TreebankWordTokenizer

#create object of TreebankWordTokenizer
tk = TreebankWordTokenizer()
text = "That's True, Mr. Vineet raj parashar"
tokens = tk.tokenize(text)
tokens

['That', "'s", 'True', ',', 'Mr.', 'Vineet', 'raj', 'parashar']

# 5) Subword Tokenization
This tokenization is very useful for specific application where sub words make significance. In this technique the most frequently used words are given unique ids and less frequent words are split into sub words and they best represent the meaning independently. For example if the word few is appearing frequently in the text it will be assigned a unique id, where fewer and fewest which are rare words and are less frequent in the text will be split into sub words like few, er, and est. This helps the language model not to learn fewer and fewest as two separate words. This allows to identify the unknown words in the data set during training. 

--> There are different types of subword tokenization and they are given below:

- Byte-Pair Encoding (BPE)
- WordPiece
- Unigram Language Model
- SentencePiece

#### i) Byte-Pair Encoding(BPE)

BPE uses Huffman encoding for tokenization meaning it uses more embedding or symbols for representing less frequent words and less symbols or embedding for more frequently used words.
The BPE tokenization is bottom up sub word tokenization technique. The steps involved in BPE algorithm is given below.
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations

#### ii) WordPiece

WordPiece is similar to BPE techniques expect the way the new token is added to the vocabulary. BPE considers the token with most frequent occurring pair of symbols to merge into the vocabulary. While WordPiece considers the frequency of individual symbols also and based on below count it merges into the vocabulary.
- Count (x, y) = frequency of (x, y) / frequency (x) * frequency (y)
- The pair of symbols with maximum count will be considered to merge into vocabulary. So it allows rare tokens to be included into vocabulary as compared to BPE.

#### iii) Unigram Language Model

In contrast to BPE or WordPiece, Unigram initializes its base vocabulary to a large number of symbols and progressively trims down each symbol to obtain a smaller vocabulary. The base vocabulary could for instance correspond to all pre-tokenized words and the most common substrings. Unigram is not used directly for any of the models in the transformers, but it’s used in conjunction with SentencePiece.

#### iv) SentencePiece

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing which treats the input as a raw input stream, thus including the space in the set of characters to use. It then uses the BPE or unigram algorithm to construct the appropriate vocabulary.
- All tokenization algorithms described so far have the same problem: It is assumed that the input text uses spaces to separate words. However, not all languages use spaces to separate words.To solve this problem more generally Sentence Piece was intoduced.

# What is Chunking in NLP?
Chunking in NLP is a process to take small pieces of information and group them into large units. 

In [18]:
from nltk import pos_tag
from nltk import RegexpParser
text ="learn DSA from SDE sheet of love babbar and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)

After Split: ['learn', 'DSA', 'from', 'SDE', 'sheet', 'of', 'love', 'babbar', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('DSA', 'NNP'), ('from', 'IN'), ('SDE', 'NNP'), ('sheet', 'NN'), ('of', 'IN'), ('love', 'NN'), ('babbar', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
  (mychunk learn/JJ)
  (mychunk DSA/NNP)
  from/IN
  (mychunk SDE/NNP sheet/NN)
  of/IN
  (mychunk love/NN babbar/NN and/CC)
  make/VB
  (mychunk study/NN easy/JJ))


In [19]:
import nltk
text = "learn DSA from SDE sheet of love babbar "
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp  =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()

['learn', 'DSA', 'from', 'SDE', 'sheet', 'of', 'love', 'babbar']
[('learn', 'JJ'), ('DSA', 'NNP'), ('from', 'IN'), ('SDE', 'NNP'), ('sheet', 'NN'), ('of', 'IN'), ('love', 'NN'), ('babbar', 'NN')]
(S
  learn/JJ
  DSA/NNP
  from/IN
  SDE/NNP
  (NP sheet/NN)
  of/IN
  (NP love/NN)
  (NP babbar/NN))


# THANK YOU