# NATURAL LANGAUGE PROCESSING (DAY 5)


- TOOLS THAT USED FOR TOKENIZATION
- SENTENCE TOKENIZATION 
- WORD TOKENIZATION 

# TOKENIZATION

# PART 1 SENTENCE TOKENIZATION METHODS

NLTK (Natural Language Tool Kit)

In [1]:
import nltk
from nltk.tokenize import sent_tokenize

In [2]:
sentence = """Tokenization is the process of breaking down a piece of text into small units called tokens. A token may be a word, part of a word or just characters like punctuation."""     

In [3]:
sent_tokenize(sentence)

['Tokenization is the process of breaking down a piece of text into small units called tokens.',
 'A token may be a word, part of a word or just characters like punctuation.']

Using re (REGEX)

In [4]:
import re
re.compile('[.!?] ').split(sentence)

['Tokenization is the process of breaking down a piece of text into small units called tokens',
 'A token may be a word, part of a word or just characters like punctuation.']

Using Spacy

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc= nlp(sentence)
for sent in doc.sents:
    print(sent.text)



Tokenization is the process of breaking down a piece of text into small units called tokens.
A token may be a word, part of a word or just characters like punctuation.


Using Keras

In [6]:
from keras.preprocessing.text import text_to_word_sequence
text_to_word_sequence(sentence, split=".", filters ="!.\n")

['tokenization is the process of breaking down a piece of text into small units called tokens',
 ' a token may be a word, part of a word or just characters like punctuation']

Using Gensim

In [7]:
from gensim.summarization.textcleaner import split_sentences
list(split_sentences(sentence))

['Tokenization is the process of breaking down a piece of text into small units called tokens.',
 'A token may be a word, part of a word or just characters like punctuation.']

# PART 2 WORD TOKENIZATION METHODS

Using Spacy

In [8]:
doc = nlp('Tokenization is the process of breaking down a piece of text into small units called tokens')
print([token.text for token in doc])

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'a', 'piece', 'of', 'text', 'into', 'small', 'units', 'called', 'tokens']


Using NLTK

In [9]:
from nltk.tokenize import word_tokenize

word_tokenize("Tokenization is the process of breaking down a piece of text into small units called tokens")

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'a',
 'piece',
 'of',
 'text',
 'into',
 'small',
 'units',
 'called',
 'tokens']

Using TreebankWordTokenizer

In [10]:
from nltk.tokenize import TreebankWordTokenizer
  
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'a',
 'piece',
 'of',
 'text',
 'into',
 'small',
 'units',
 'called',
 'tokens.',
 'A',
 'token',
 'may',
 'be',
 'a',
 'word',
 ',',
 'part',
 'of',
 'a',
 'word',
 'or',
 'just',
 'characters',
 'like',
 'punctuation',
 '.']

SPLIT METHOD

In [11]:
sentence.split()

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'a',
 'piece',
 'of',
 'text',
 'into',
 'small',
 'units',
 'called',
 'tokens.',
 'A',
 'token',
 'may',
 'be',
 'a',
 'word,',
 'part',
 'of',
 'a',
 'word',
 'or',
 'just',
 'characters',
 'like',
 'punctuation.']

REGEX METHOD - REGULAR EXPRESSION TOKENIZER

NOTE: 
    - Refer this https://regex101.com/ site to try or write regex patterns on our own

In [12]:
re.findall('[\w]+', sentence) #https://regex101.com/

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'a',
 'piece',
 'of',
 'text',
 'into',
 'small',
 'units',
 'called',
 'tokens',
 'A',
 'token',
 'may',
 'be',
 'a',
 'word',
 'part',
 'of',
 'a',
 'word',
 'or',
 'just',
 'characters',
 'like',
 'punctuation']

USING KERAS

In [13]:
from keras.preprocessing.text import text_to_word_sequence
text_to_word_sequence(sentence)

['tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'a',
 'piece',
 'of',
 'text',
 'into',
 'small',
 'units',
 'called',
 'tokens',
 'a',
 'token',
 'may',
 'be',
 'a',
 'word',
 'part',
 'of',
 'a',
 'word',
 'or',
 'just',
 'characters',
 'like',
 'punctuation']

GENSIM

In [14]:
from gensim.utils import tokenize
list(tokenize(sentence))

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'a',
 'piece',
 'of',
 'text',
 'into',
 'small',
 'units',
 'called',
 'tokens',
 'A',
 'token',
 'may',
 'be',
 'a',
 'word',
 'part',
 'of',
 'a',
 'word',
 'or',
 'just',
 'characters',
 'like',
 'punctuation']

Reference : https://www.analyticsvidhya.com/ , https://medium.com/ , https://www.geeksforgeeks.org/ , https://regex101.com/

# NATURAL LANGAUGE PROCESSING (DAY 6)

- STEMMING
- LEMMATIZATION 

Stemming and Lemmatization are algorithms that are used in Natural Language Processing (NLP) to normalize text and prepare words and documents for further processing in Machine Learning. In NLP, for example, you may want to acknowledge the fact that the words “like” and “liked” are the same word in different tenses.