## Tokenization
- It is the process of converting a sequence of characters into a sequence of tokens. A token is a string of characters that represents a unit of meaning, such as a word, punctuation mark, or special character.

In [1]:
corpus = """Tokenization is the process of converting a sequence of characters into a sequence of tokens. A token is a string of characters that represents a unit of meaning, such as a word, punctuation mark, or special character.
This process is essential in natural language processing (NLP) tasks, as it allows for the analysis and manipulation of text data.
There are several methods of tokenization, including word tokenization, sentence tokenization, and subword tokenization. Each method has its own advantages and disadvantages, and the choice of method depends on the specific NLP task at hand.
Tokenization is a crucial step in many NLP applications, such as text classification, sentiment analysis, and machine translation.
It is base of any NLP task!
"""

In [2]:
corpus

'Tokenization is the process of converting a sequence of characters into a sequence of tokens. A token is a string of characters that represents a unit of meaning, such as a word, punctuation mark, or special character.\nThis process is essential in natural language processing (NLP) tasks, as it allows for the analysis and manipulation of text data.\nThere are several methods of tokenization, including word tokenization, sentence tokenization, and subword tokenization. Each method has its own advantages and disadvantages, and the choice of method depends on the specific NLP task at hand.\nTokenization is a crucial step in many NLP applications, such as text classification, sentiment analysis, and machine translation.\nIt is base of any NLP task!\n'

In [3]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# Force download punkt tokenizer
nltk.download('punkt', force=True)
nltk.download('punkt_tab', force=True)

[nltk_data] Downloading package punkt to /Users/dhruvsmac/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/dhruvsmac/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
## Paragraph to Sentence Tokenization
from nltk.tokenize import sent_tokenize

In [5]:
sentences = sent_tokenize(corpus)

> sentences in the nlp also known as documents or texts

In [6]:
sentences

['Tokenization is the process of converting a sequence of characters into a sequence of tokens.',
 'A token is a string of characters that represents a unit of meaning, such as a word, punctuation mark, or special character.',
 'This process is essential in natural language processing (NLP) tasks, as it allows for the analysis and manipulation of text data.',
 'There are several methods of tokenization, including word tokenization, sentence tokenization, and subword tokenization.',
 'Each method has its own advantages and disadvantages, and the choice of method depends on the specific NLP task at hand.',
 'Tokenization is a crucial step in many NLP applications, such as text classification, sentiment analysis, and machine translation.',
 'It is base of any NLP task!']

In [8]:
## Word Tokenization
from nltk.tokenize import word_tokenize

In [9]:
words = word_tokenize(corpus)

In [10]:
words

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'converting',
 'a',
 'sequence',
 'of',
 'characters',
 'into',
 'a',
 'sequence',
 'of',
 'tokens',
 '.',
 'A',
 'token',
 'is',
 'a',
 'string',
 'of',
 'characters',
 'that',
 'represents',
 'a',
 'unit',
 'of',
 'meaning',
 ',',
 'such',
 'as',
 'a',
 'word',
 ',',
 'punctuation',
 'mark',
 ',',
 'or',
 'special',
 'character',
 '.',
 'This',
 'process',
 'is',
 'essential',
 'in',
 'natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'tasks',
 ',',
 'as',
 'it',
 'allows',
 'for',
 'the',
 'analysis',
 'and',
 'manipulation',
 'of',
 'text',
 'data',
 '.',
 'There',
 'are',
 'several',
 'methods',
 'of',
 'tokenization',
 ',',
 'including',
 'word',
 'tokenization',
 ',',
 'sentence',
 'tokenization',
 ',',
 'and',
 'subword',
 'tokenization',
 '.',
 'Each',
 'method',
 'has',
 'its',
 'own',
 'advantages',
 'and',
 'disadvantages',
 ',',
 'and',
 'the',
 'choice',
 'of',
 'method',
 'depends',
 'on',
 'the',
 'specific',


In [11]:
# Word punctuation tokenizer
from nltk.tokenize import wordpunct_tokenize

> The word punctuation tokenizer splits the text into tokens based on both whitespace and punctuation characters. This means that it treats punctuation marks as separate tokens, which can be useful for certain NLP tasks where punctuation carries meaning.

In [12]:
words_punct = wordpunct_tokenize(corpus)

In [13]:
words_punct

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'converting',
 'a',
 'sequence',
 'of',
 'characters',
 'into',
 'a',
 'sequence',
 'of',
 'tokens',
 '.',
 'A',
 'token',
 'is',
 'a',
 'string',
 'of',
 'characters',
 'that',
 'represents',
 'a',
 'unit',
 'of',
 'meaning',
 ',',
 'such',
 'as',
 'a',
 'word',
 ',',
 'punctuation',
 'mark',
 ',',
 'or',
 'special',
 'character',
 '.',
 'This',
 'process',
 'is',
 'essential',
 'in',
 'natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'tasks',
 ',',
 'as',
 'it',
 'allows',
 'for',
 'the',
 'analysis',
 'and',
 'manipulation',
 'of',
 'text',
 'data',
 '.',
 'There',
 'are',
 'several',
 'methods',
 'of',
 'tokenization',
 ',',
 'including',
 'word',
 'tokenization',
 ',',
 'sentence',
 'tokenization',
 ',',
 'and',
 'subword',
 'tokenization',
 '.',
 'Each',
 'method',
 'has',
 'its',
 'own',
 'advantages',
 'and',
 'disadvantages',
 ',',
 'and',
 'the',
 'choice',
 'of',
 'method',
 'depends',
 'on',
 'the',
 'specific',


In [15]:
# To see the difference between word_tokenize and wordpunct_tokenize
corpus2 = "Hello! How's it going? This is an example: tokenization, NLP."
word_tokens = word_tokenize(corpus2)
wordpunct_tokens = wordpunct_tokenize(corpus2)

In [16]:
word_tokens

['Hello',
 '!',
 'How',
 "'s",
 'it',
 'going',
 '?',
 'This',
 'is',
 'an',
 'example',
 ':',
 'tokenization',
 ',',
 'NLP',
 '.']

In [17]:
wordpunct_tokens

['Hello',
 '!',
 'How',
 "'",
 's',
 'it',
 'going',
 '?',
 'This',
 'is',
 'an',
 'example',
 ':',
 'tokenization',
 ',',
 'NLP',
 '.']

In [18]:
# Compare both tokenizers
for wt, wpt in zip(word_tokens, wordpunct_tokens):
    print(f"word_tokenize: {wt} \t|\t wordpunct_tokenize: {wpt}")

word_tokenize: Hello 	|	 wordpunct_tokenize: Hello
word_tokenize: ! 	|	 wordpunct_tokenize: !
word_tokenize: How 	|	 wordpunct_tokenize: How
word_tokenize: 's 	|	 wordpunct_tokenize: '
word_tokenize: it 	|	 wordpunct_tokenize: s
word_tokenize: going 	|	 wordpunct_tokenize: it
word_tokenize: ? 	|	 wordpunct_tokenize: going
word_tokenize: This 	|	 wordpunct_tokenize: ?
word_tokenize: is 	|	 wordpunct_tokenize: This
word_tokenize: an 	|	 wordpunct_tokenize: is
word_tokenize: example 	|	 wordpunct_tokenize: an
word_tokenize: : 	|	 wordpunct_tokenize: example
word_tokenize: tokenization 	|	 wordpunct_tokenize: :
word_tokenize: , 	|	 wordpunct_tokenize: tokenization
word_tokenize: NLP 	|	 wordpunct_tokenize: ,
word_tokenize: . 	|	 wordpunct_tokenize: NLP


In [19]:
# Treebank Word Tokenizer
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()

In [21]:
treebank_toks = treebank_tokenizer.tokenize(corpus)

In [22]:
treebank_toks

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'converting',
 'a',
 'sequence',
 'of',
 'characters',
 'into',
 'a',
 'sequence',
 'of',
 'tokens.',
 'A',
 'token',
 'is',
 'a',
 'string',
 'of',
 'characters',
 'that',
 'represents',
 'a',
 'unit',
 'of',
 'meaning',
 ',',
 'such',
 'as',
 'a',
 'word',
 ',',
 'punctuation',
 'mark',
 ',',
 'or',
 'special',
 'character.',
 'This',
 'process',
 'is',
 'essential',
 'in',
 'natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'tasks',
 ',',
 'as',
 'it',
 'allows',
 'for',
 'the',
 'analysis',
 'and',
 'manipulation',
 'of',
 'text',
 'data.',
 'There',
 'are',
 'several',
 'methods',
 'of',
 'tokenization',
 ',',
 'including',
 'word',
 'tokenization',
 ',',
 'sentence',
 'tokenization',
 ',',
 'and',
 'subword',
 'tokenization.',
 'Each',
 'method',
 'has',
 'its',
 'own',
 'advantages',
 'and',
 'disadvantages',
 ',',
 'and',
 'the',
 'choice',
 'of',
 'method',
 'depends',
 'on',
 'the',
 'specific',
 'NLP',
 'task',
 'a

In [25]:
# Compare tree bank tokenizer with wordpunct_tokenize
for tbt, wpt in zip(treebank_toks, words_punct):
    print(f"Treebank: {tbt} \t |\t wordpunct_tokenize: {wpt}")

Treebank: Tokenization 	 |	 wordpunct_tokenize: Tokenization
Treebank: is 	 |	 wordpunct_tokenize: is
Treebank: the 	 |	 wordpunct_tokenize: the
Treebank: process 	 |	 wordpunct_tokenize: process
Treebank: of 	 |	 wordpunct_tokenize: of
Treebank: converting 	 |	 wordpunct_tokenize: converting
Treebank: a 	 |	 wordpunct_tokenize: a
Treebank: sequence 	 |	 wordpunct_tokenize: sequence
Treebank: of 	 |	 wordpunct_tokenize: of
Treebank: characters 	 |	 wordpunct_tokenize: characters
Treebank: into 	 |	 wordpunct_tokenize: into
Treebank: a 	 |	 wordpunct_tokenize: a
Treebank: sequence 	 |	 wordpunct_tokenize: sequence
Treebank: of 	 |	 wordpunct_tokenize: of
Treebank: tokens. 	 |	 wordpunct_tokenize: tokens
Treebank: A 	 |	 wordpunct_tokenize: .
Treebank: token 	 |	 wordpunct_tokenize: A
Treebank: is 	 |	 wordpunct_tokenize: token
Treebank: a 	 |	 wordpunct_tokenize: is
Treebank: string 	 |	 wordpunct_tokenize: a
Treebank: of 	 |	 wordpunct_tokenize: string
Treebank: characters 	 |	 wordpun