## <span style="color: #B66A50;">Tokenization: A Foundational Step in NLP </span>

Tokenization, a basic yet crucial stage in Natural Language Processing (NLP) tasks, involves breaking down a paragraph of text into smaller units such as sentences or individual words (tokens). This fundamental process is often performed using popular NLP libraries like **NLTK (Natural Language Toolkit)**.


In [20]:
%pip uninstall nltk -y

Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [21]:
%pip install nltk

Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
corpus = """ My name is sourav kumar. 
This is the notebook to understand Basics of Tokenization.
that help me to maintain NLP repo.
"""

In [4]:
print(corpus)

 My name is sourav kumar. 
This is the notebook to understand Basics of Tokenization.
that help me to maintain NLP repo.



In [1]:
# Ensure nltk is not partially initialized by re-importing it
import importlib
import nltk
nltk = importlib.reload(nltk)
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
#tokenization into sentences.
from nltk.tokenize import sent_tokenize
document = sent_tokenize(corpus)
type(document)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


list

In [6]:
for sentence in document:
    print(sentence)

 My name is sourav kumar.
This is the notebook to understand Basics of Tokenization.
that help me to maintain NLP repo.


In [8]:
#word tokenize
from nltk.tokenize import word_tokenize
sent = word_tokenize(corpus)
type(sent)

list

In [9]:
for word in sent:
    print(word)

My
name
is
sourav
kumar
.
This
is
the
notebook
to
understand
Basics
of
Tokenization
.
that
help
me
to
maintain
NLP
repo
.


In [10]:
from nltk.tokenize import word_tokenize
for sentence in document:
    print(word_tokenize(sentence))

['My', 'name', 'is', 'sourav', 'kumar', '.']
['This', 'is', 'the', 'notebook', 'to', 'understand', 'Basics', 'of', 'Tokenization', '.']
['that', 'help', 'me', 'to', 'maintain', 'NLP', 'repo', '.']


In [16]:
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)


['My',
 'name',
 'is',
 'sourav',
 'kumar',
 '.',
 'This',
 'is',
 'the',
 'notebook',
 'to',
 'understand',
 'Basics',
 'of',
 'Tokenization',
 '.',
 'that',
 'help',
 'me',
 'to',
 'maintain',
 'NLP',
 'repo',
 '.']

In [18]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['My',
 'name',
 'is',
 'sourav',
 'kumar.',
 'This',
 'is',
 'the',
 'notebook',
 'to',
 'understand',
 'Basics',
 'of',
 'Tokenization.',
 'that',
 'help',
 'me',
 'to',
 'maintain',
 'NLP',
 'repo',
 '.']

In [19]:
from nltk.tokenize import regexp_tokenize
regexp_tokenize(corpus, pattern=r'\s+', gaps=True)

['My',
 'name',
 'is',
 'sourav',
 'kumar.',
 'This',
 'is',
 'the',
 'notebook',
 'to',
 'understand',
 'Basics',
 'of',
 'Tokenization.',
 'that',
 'help',
 'me',
 'to',
 'maintain',
 'NLP',
 'repo.']

# Summary of tokenization use cases:

1. sent_tokenize: Used to break a paragraph into sentences.
2. word_tokenize: Used to break text into individual words.
3. wordpunct_tokenize: Used to tokenize text into words and punctuation separately.
4. TreebankWordTokenizer: Used for more sophisticated word tokenization, handling contractions and punctuation.
5. regexp_tokenize: Used to tokenize text based on a custom regular expression pattern
