# Introduction

![Tokenization.png](attachment:Tokenization.png)

The input in natural language processing is text. The data collection for this text happens from a lot of sources. This requires a lot of cleaning and processing before the data can be used for analysis.

The first thing you need to do in any NLP project is text preprocessing. Preprocessing input text simply means putting the data into a predictable and analyzable form. It’s a crucial step for building an amazing NLP application.

*There are different ways to preprocess text:* 

- Tokenization
- Stop words removal
- Stemming
- Normalization
- Lemmatization


### Introduction to Tokenization

Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. 

- what is a Token?: 
    A sentence contains tokens which can be words, numbers, symbols, punctuation marks, and so on. They all carry meaning, that’s why a token is called an element or a unit of semantics.



#### Why do we need tokenization?

Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document. 

This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. They can also be used directly by a computer to trigger useful actions and responses. Or they might be used in a machine learning pipeline as features that trigger more complex decisions or behavior.

#### Tokenization Techniques 
There are multiple ways we can perform tokenization on given text data. We can choose any method based on language, library and purpose of modeling.

## Tokenization Using Split method ¶


In [1]:
## Word Tokenization -> python

text = """Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers."""
# Split text by whitespace
tokens = text.split()
print(tokens)

['Tokenization', 'is', 'a', 'common', 'task', 'in', 'Natural', 'Language', 'Processing', '(NLP).', 'It’s', 'a', 'fundamental', 'step', 'in', 'both', 'traditional', 'NLP', 'methods', 'like', 'Count', 'Vectorizer', 'and', 'Advanced', 'Deep', 'Learning-based', 'architectures', 'like', 'Transformers.']


*Note:* 
- Observe in above list, words like 'Transformers.' and '(NLP).' are containing punctuation at the end of them. 
    
- Python split method do not consider punctuation as separate token.



In [2]:
### Sentence Tokenization -> python

# Lets split the given text by full stop (.)

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be fool proof with split() method."""

text.split(". ") # Note the space after the full stop makes sure that we dont get empty element at the end of list.

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be fool proof with split() method.']

*Note*

- As you can see, split() since we can't use multiple separator split() method failed to split the last sentence from separator (!). 

- We can overcome this drawback by applying split method multiple times with different separator but there are better ways to do it.



## Tokenization Using Regular Expressions(RegEx) ¶


In [3]:
import re

text = """Tokenization is a 034 common task in @ # Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers."""

tokens = re.findall("[\w]+", text)
print(tokens)

['Tokenization', 'is', 'a', '034', 'common', 'task', 'in', 'Natural', 'Language', 'Processing', 'NLP', 'It', 's', 'a', 'fundamental', 'step', 'in', 'both', 'traditional', 'NLP', 'methods', 'like', 'Count', 'Vectorizer', 'and', 'Advanced', 'Deep', 'Learning', 'based', 'architectures', 'like', 'Transformers']


- RegEx pattern signifies that the code should find all the alphanumeric characters until any other character is encountered.

In [4]:
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

tokens_sent = re.compile('[.!?] ').split(text) # Using compile method to combine RegEx patterns
tokens_sent

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time',
 'So sentence tonenization wont be foolproof with split() method.']

# NLTK Tokenization 

### Word Tokenization

**NLTK Word Tokenizer:**

   - **Objective:** NLTK (Natural Language Toolkit) provides a word tokenizer that breaks text into words, ignoring punctuation marks.
   
   - **Tokenization Criteria:** It focuses on breaking the text into words, considering spaces and other non-alphabetic characters as separators.
   
   - **Usage:** It is commonly used when you want to analyze the frequency of words, create a bag-of-words model, or perform other word-centric analyses.
   
   - **Example:** Using the same example as above, the NLTK word tokenizer would split the sentence into tokens: ["Hello", "how", "are", "you", "today"].



- Natural Language Toolkit (NLTK)  is an open-source Python library written in python for natural language processing.

- NLTK has module word_tokenize() for word tokenization and sent_tokenize() for sentence tokenization.

In [5]:
# NLTK Installation
#  we are going to use "!" before the command to let notebook know that, it should read as commandline command
# !pip install --user -U nltk
# nltk has been installed using requirements.txt file

In [6]:
from nltk.tokenize import word_tokenize

text = """Tokenization is breaking the raw text into small chunks. 
            Tokenization breaks the raw text into words, sentences called tokens. 
            These tokens help in understanding the context or developing the model 
            for the NLP. The tokenization helps in interpreting the meaning of 
            the text by analyzing the sequence of the words. #Hope #shivan.K"""

tokens = word_tokenize(text)
print(tokens)

['Tokenization', 'is', 'breaking', 'the', 'raw', 'text', 'into', 'small', 'chunks', '.', 'Tokenization', 'breaks', 'the', 'raw', 'text', 'into', 'words', ',', 'sentences', 'called', 'tokens', '.', 'These', 'tokens', 'help', 'in', 'understanding', 'the', 'context', 'or', 'developing', 'the', 'model', 'for', 'the', 'NLP', '.', 'The', 'tokenization', 'helps', 'in', 'interpreting', 'the', 'meaning', 'of', 'the', 'text', 'by', 'analyzing', 'the', 'sequence', 'of', 'the', 'words', '.', '#', 'Hope', '#', 'shivan.K']


*Note*:
- Notice that NLTK word tokenization also consider the punctuation as token.

### Punctuation-based tokenizer
A punctuation-based tokenizer and a word tokenizer serve different purposes in natural language processing (NLP) and text processing. Let's discuss the key differences:

**Punctuation-Based Tokenizer:**
   
   - **Objective:** A punctuation-based tokenizer splits text into tokens based on the presence of punctuation marks.
   
   - **Tokenization Criteria:** It uses punctuation marks (like periods, commas, exclamation marks, etc.) as the primary criteria for breaking text into tokens.
   
   - **Usage:** It is useful when you want to analyze text in terms of sentences or other units separated by punctuation.
   
   - **Example:** If you have a sentence like "Hello, how are you today?", a punctuation-based tokenizer would split it into tokens: ["Hello", ",", "how", "are", "you", "today", "?"].


In [7]:
# Punctuation-based tokenizer
"""
This tokenizer splits the sentences into words based on whitespaces and 
punctuations.
"""
from nltk.tokenize import wordpunct_tokenize

print(wordpunct_tokenize(text))


['Tokenization', 'is', 'breaking', 'the', 'raw', 'text', 'into', 'small', 'chunks', '.', 'Tokenization', 'breaks', 'the', 'raw', 'text', 'into', 'words', ',', 'sentences', 'called', 'tokens', '.', 'These', 'tokens', 'help', 'in', 'understanding', 'the', 'context', 'or', 'developing', 'the', 'model', 'for', 'the', 'NLP', '.', 'The', 'tokenization', 'helps', 'in', 'interpreting', 'the', 'meaning', 'of', 'the', 'text', 'by', 'analyzing', 'the', 'sequence', 'of', 'the', 'words', '.', '#', 'Hope', '#', 'shivan', '.', 'K']


### Sentence Tokenization

In [8]:
from nltk.tokenize import sent_tokenize

text = """Tokenization is breaking the raw text into small chunks. 
            Tokenization breaks the raw text into words, sentences called tokens. 
            These tokens help in understanding the context or developing the model 
            for the NLP. The tokenization helps in interpreting the meaning of 
            the text by analyzing the sequence of the words."""

sent_tokenize(text)

['Tokenization is breaking the raw text into small chunks.',
 'Tokenization breaks the raw text into words, sentences called tokens.',
 'These tokens help in understanding the context or developing the model \n            for the NLP.',
 'The tokenization helps in interpreting the meaning of \n            the text by analyzing the sequence of the words.']

*Note*:
- \n is a next line character

In [9]:
from nltk.tokenize import sent_tokenize
text = """Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words."""
sent_tokenize(text)

['Tokenization is breaking the raw text into small chunks.',
 'Tokenization breaks the raw text into words, sentences called tokens.',
 'These tokens help in understanding the context or developing the model for the NLP.',
 'The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.']

### Tweet tokenizer

When we want to apply tokenization in text data like tweets, the tokenizers mentioned above can’t produce practical tokens. Through this issue, NLTK has a rule based tokenizer special for tweets. We can split emojis into different words if we need them for tasks like sentiment analysis.

In [40]:
from nltk.tokenize import sent_tokenize

text = "Don't take cryptocurrency advice from people on Twitter😅🤭"

sent_tokenize(text)

["Don't take cryptocurrency advice from people on Twitter😅🤭"]

In [10]:
from nltk.tokenize import TweetTokenizer

tweet = "Don't take cryptocurrency advice from people on Twitter😅🤭"

tokenizer = TweetTokenizer()

print(tokenizer.tokenize(tweet))

["Don't", 'take', 'cryptocurrency', 'advice', 'from', 'people', 'on', 'Twitter', '😅', '🤭']


In [11]:
from nltk.tokenize import word_tokenize

tweet = "Don't take cryptocurrency advice from people on Twitter😅🤭"


word_tokenize(tweet)

['Do',
 "n't",
 'take',
 'cryptocurrency',
 'advice',
 'from',
 'people',
 'on',
 'Twitter😅🤭']

# SpaCy using Tokenization

- spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython

- In spaCy we create language model object, which then used for word and sentence tokenization


In [8]:
# !pip install spacy
# !python -m spacy download en
# spacy has been installed using requirements.txt file

##### Word Tokenizer

In [10]:
# Load English model from spacy
from spacy.lang.en import English

nlp = English()

text = """Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words."""

my_doc = nlp(text)

token_list = []
for token in my_doc:
    token_list.append(token.text)

print(token_list)

['Tokenization', 'is', 'breaking', 'the', 'raw', 'text', 'into', 'small', 'chunks', '.', 'Tokenization', 'breaks', 'the', 'raw', 'text', 'into', 'words', ',', 'sentences', 'called', 'tokens', '.', 'These', 'tokens', 'help', 'in', 'understanding', 'the', 'context', 'or', 'developing', 'the', 'model', 'for', 'the', 'NLP', '.', 'The', 'tokenization', 'helps', 'in', 'interpreting', 'the', 'meaning', 'of', 'the', 'text', 'by', 'analyzing', 'the', 'sequence', 'of', 'the', 'words', '.']


In [13]:
# !python -m spacy download en_core_web_md 
import spacy

# Download the spaCy model if not already downloaded
try:
    nlp = spacy.load("en_core_web_md")
except OSError:
    spacy.cli.download("en_core_web_md")
    nlp = spacy.load("en_core_web_md")

doc = nlp("We are learning NLP using spaCy")

print([token.text for token in doc])


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
['We', 'are', 'learning', 'NLP', 'using', 'spaCy']


In [14]:
import spacy

nlp = spacy.load("en_core_web_md")

text = """Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words."""

doc = nlp(text)

print([token.text for token in doc])


['Tokenization', 'is', 'breaking', 'the', 'raw', 'text', 'into', 'small', 'chunks', '.', 'Tokenization', 'breaks', 'the', 'raw', 'text', 'into', 'words', ',', 'sentences', 'called', 'tokens', '.', 'These', 'tokens', 'help', 'in', 'understanding', 'the', 'context', 'or', 'developing', 'the', 'model', 'for', 'the', 'NLP', '.', 'The', 'tokenization', 'helps', 'in', 'interpreting', 'the', 'meaning', 'of', 'the', 'text', 'by', 'analyzing', 'the', 'sequence', 'of', 'the', 'words', '.']


##### Sentance Tokenizer

In [15]:
import spacy

nlp = spacy.load("en_core_web_md")

text = """Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words."""

doc = nlp(text)

for sent in doc.sents:
    print(sent.text)

Tokenization is breaking the raw text into small chunks.
Tokenization breaks the raw text into words, sentences called tokens.
These tokens help in understanding the context or developing the model for the NLP.
The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.


In [16]:
# Load English tokenizer, tager, parser, NER and word vectors
nlp = English()

# Create the pipeline 'sentencizer' component
#sbd = nlp.create_pipe('sentencizer')

# Add component to the pipeline
nlp.add_pipe('sentencizer')

text = """Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words."""

# nlp object is used to create documents with linguistic annotations
doc = nlp(text)

# Create list of sentence tokens

sentence_list =[]
for sentence in doc.sents:
    sentence_list.append(sentence.text)
print(sentence_list)

['Tokenization is breaking the raw text into small chunks.', 'Tokenization breaks the raw text into words, sentences called tokens.', 'These tokens help in understanding the context or developing the model for the NLP.', 'The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.']


# Assignment 1: Which Tokenization Should we use?

# Happy Learning -> Success Analytics