## NLP basics

There are different ways to preprocess text: 

 * **stop word removal**, 

* **tokenization**, 

* **stemming**. 

### Tokenization. 

    It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens

#### NLTK Word Tokenize
    NLTK (Natural Language Toolkit)NLTK (Natural Language Toolkit)

In [70]:
import nltk
from nltk.tokenize import (word_tokenize,sent_tokenize,wordpunct_tokenize,TreebankWordTokenizer,TweetTokenizer,MWETokenizer)
nltk.download('punkt')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [71]:
text = 'The Sole meaning of life, is to serve humanity, #we.are.humans.'

#### 1. Word Tokenization

In [72]:
print(f'Word Tokenize: {word_tokenize(text)}')

Word Tokenize: ['The', 'Sole', 'meaning', 'of', 'life', ',', 'is', 'to', 'serve', 'humanity', ',', '#', 'we.are.humans', '.']


In [73]:
word_tokenize(text)

['The',
 'Sole',
 'meaning',
 'of',
 'life',
 ',',
 'is',
 'to',
 'serve',
 'humanity',
 ',',
 '#',
 'we.are.humans',
 '.']

In [74]:
# using split 
text.split()

['The',
 'Sole',
 'meaning',
 'of',
 'life,',
 'is',
 'to',
 'serve',
 'humanity,',
 '#we.are.humans.']

It included the punctuation mark: "." & "," so to avoid that nltk is used for better & fast processing 

In [75]:
# using split 
text.split(', ')

['The Sole meaning of life', 'is to serve humanity', '#we.are.humans.']

#### 2. Sentence Tokenization:

In [76]:
print(f'Sentence Tokenize: {sent_tokenize(text)}')
#N.B: The sent_tokenize uses the pre-trained model from tokenizers/punkt/english.pickle.

Sentence Tokenize: ['The Sole meaning of life, is to serve humanity, #we.are.humans.']


#### 3. Punctuation-based tokenizer:
    This tokenizer splits the sentences into words based on whitespaces and punctuations.

In [77]:
print(f'wordpunct_tokenize: {wordpunct_tokenize(text)}')

wordpunct_tokenize: ['The', 'Sole', 'meaning', 'of', 'life', ',', 'is', 'to', 'serve', 'humanity', ',', '#', 'we', '.', 'are', '.', 'humans', '.']


We could notice the difference between considering “we.are.humans” a word in word_tokenize and split it in the wordpunct_tokenize.

#### 4. Treebank Word tokenizer
    It separates phrase-terminating punctuation like (?!.;,) from adjacent tokens and retains decimal numbers as a single token. Besides, it contains rules for English contractions. 

    For example “don’t” is tokenized as [“do”, “n’t”]. You can find all the rules for the Treebank Tokenizer at this link. 


http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.treebank

In [78]:
text = "What you don't want to be done to yourself, don't do to others"
tokenizer = TreebankWordTokenizer() # takes no arguments, so we create an object
print(f'Treebank Word tokenizer: {tokenizer.tokenize(text)}')

Treebank Word tokenizer: ['What', 'you', 'do', "n't", 'want', 'to', 'be', 'done', 'to', 'yourself', ',', 'do', "n't", 'do', 'to', 'others']


In [79]:
print(f'wordpunct_tokenize: {wordpunct_tokenize(text)}')

wordpunct_tokenize: ['What', 'you', 'don', "'", 't', 'want', 'to', 'be', 'done', 'to', 'yourself', ',', 'don', "'", 't', 'do', 'to', 'others']


#### 5. Tweet tokenizer
    When we want to apply tokenization in text data like tweets, the tokenizers mentioned above can’t produce practical tokens. Through this issue, NLTK has a rule based tokenizer special for tweets. We can split emojis into different words if we need them for tasks like sentiment analysis.

In [80]:
tweet = "Don't take crytocurrency advice from people on twitter 😊😂"
tokenizer = TweetTokenizer() # takes no arguments, so we create an object
print(f'Tweet tokenizer: {tokenizer.tokenize(tweet)}')

Tweet tokenizer: ["Don't", 'take', 'crytocurrency', 'advice', 'from', 'people', 'on', 'twitter', '😊', '😂']


#### 6. MWET tokenizer
    NLTK’s multi-word expression tokenizer (MWETokenizer) provides a function add_mwe() that allows the user to enter multiple word expressions before using the tokenizer on the text. More simply, it can merge multi-word expressions into single tokens.

In [81]:
text = "Hope is the only thing stronger than fear! Hunger games #HOPE"
tokenizer = MWETokenizer()
print(f'MWET: {tokenizer.tokenize(text)}')

MWET: ['H', 'o', 'p', 'e', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'o', 'n', 'l', 'y', ' ', 't', 'h', 'i', 'n', 'g', ' ', 's', 't', 'r', 'o', 'n', 'g', 'e', 'r', ' ', 't', 'h', 'a', 'n', ' ', 'f', 'e', 'a', 'r', '!', ' ', 'H', 'u', 'n', 'g', 'e', 'r', ' ', 'g', 'a', 'm', 'e', 's', ' ', '#', 'H', 'O', 'P', 'E']


In [82]:
print(f'MWET: {tokenizer.tokenize(word_tokenize(text))}')

MWET: ['Hope', 'is', 'the', 'only', 'thing', 'stronger', 'than', 'fear', '!', 'Hunger', 'games', '#', 'HOPE']


In [83]:
tokenizer.add_mwe(('Hunger','games'))
print(f'MWET: {tokenizer.tokenize(word_tokenize(text))}')

MWET: ['Hope', 'is', 'the', 'only', 'thing', 'stronger', 'than', 'fear', '!', 'Hunger_games', '#', 'HOPE']


### TextBlob Word Tokenize
    TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Let’s start by installing TextBlob and the NLTK corpora:

    pip install -U textblob 
    python3 -m textblob.download_corpora
    
    In the code below, we perform word tokenization using TextBlob library:

In [84]:
!pip install -U textblob




In [85]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /home/jovyan/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


In [86]:
from textblob import TextBlob
text = "But I am glad that you'll see me as I am. Above all, I wouldn't want people to think that I want to prove anything but I don't want."\

blob_object = TextBlob(text)

In [87]:
# Word Tokenization of the text
text_words = blob_object.words

# To see all tokens
print(text_words)
# To count no. of tokens
print(len(text_words))

['But', 'I', 'am', 'glad', 'that', 'you', "'ll", 'see', 'me', 'as', 'I', 'am', 'Above', 'all', 'I', 'would', "n't", 'want', 'people', 'to', 'think', 'that', 'I', 'want', 'to', 'prove', 'anything', 'but', 'I', 'do', "n't", 'want']
32


We could notice that the TextBlob tokenizer removes the punctuations. In addition, it has rules for English contractions.