### Tokenization

**Tokenization** in NLP is the process of splitting text into smaller units called *tokens*. These tokens can be words, sentences, or characters.

- **Word Tokenization**: Splitting text into words (e.g., "I love programming" → `["I", "love", "programming"]`).
- **Sentence Tokenization**: Splitting text into sentences (e.g., "I love programming. It’s fun." → `["I love programming.", "It’s fun."]`).
- **Character Tokenization**: Splitting text into individual characters (e.g., "hello" → `["h", "e", "l", "l", "o"]`).
- **Subword Tokenization**: Splitting words into meaningful sub-units (e.g., "unhappiness" → `["un", "happiness"]`).

Tokenization is essential for breaking down text into manageable parts for further NLP tasks like text classification, sentiment analysis, or machine translation.

In [16]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\sayan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [4]:
corpus = 'Hello , My name is Sayan De. I am Data Science enthusiast.I build Machine Learning Projects focused on Predictive analytics. Right now I am trying to improve my conceptual understanding for NLP and Deep Learning.'

In [8]:
print(corpus)

Hello , My name is Sayan De. I am Data Science enthusiast.I build Machine Learning Projects focused on Predictive analytics. Right now I am trying to improve my conceptual understanding for NLP and Deep Learning.


In [20]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [26]:
sentences = sent_tokenize(corpus)

In [28]:
sentences

['Hello , My name is Sayan De.',
 'I am Data Science enthusiast.I build Machine Learning Projects focused on Predictive analytics.',
 'Right now I am trying to improve my conceptual understanding for NLP and Deep Learning.']

In [22]:
words = word_tokenize(corpus)

In [24]:
words

['Hello',
 ',',
 'My',
 'name',
 'is',
 'Sayan',
 'De',
 '.',
 'I',
 'am',
 'Data',
 'Science',
 'enthusiast.I',
 'build',
 'Machine',
 'Learning',
 'Projects',
 'focused',
 'on',
 'Predictive',
 'analytics',
 '.',
 'Right',
 'now',
 'I',
 'am',
 'trying',
 'to',
 'improve',
 'my',
 'conceptual',
 'understanding',
 'for',
 'NLP',
 'and',
 'Deep',
 'Learning',
 '.']

In [42]:
from nltk.tokenize import TreebankWordTokenizer

In [46]:
tok = TreebankWordTokenizer()

In [48]:
tok.tokenize(corpus)

['Hello',
 ',',
 'My',
 'name',
 'is',
 'Sayan',
 'De.',
 'I',
 'am',
 'Data',
 'Science',
 'enthusiast.I',
 'build',
 'Machine',
 'Learning',
 'Projects',
 'focused',
 'on',
 'Predictive',
 'analytics.',
 'Right',
 'now',
 'I',
 'am',
 'trying',
 'to',
 'improve',
 'my',
 'conceptual',
 'understanding',
 'for',
 'NLP',
 'and',
 'Deep',
 'Learning',
 '.']

In [50]:
type(tok.tokenize(corpus))

list