# Tokenization #

### Common Terminologies ###

1. A paragraph  is known as corpus
2. Sentences are documents
3. All the unique words is known as vocabulary
4. All the words are known as words

#### What is Tokenization?

Tokenization is a process where we take in paragraphs or sentences and convert them into tokens. Paragraphs can be converted into sentences and word tokens.

In [1]:
!pip install nltk



In [2]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\vik43\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
##Converting paragraphs to sentences
from nltk.tokenize import sent_tokenize

In [4]:
corpus=""" The world around us is constantly evolving, driven by advancements in technology and the endless pursuit of knowledge. 
As we adapt to new challenges and opportunities, it is important to remain open-minded and embrace change. 
The key to thriving in this dynamic environment is not only to stay informed but also to foster creativity and innovation. 
By working together, sharing ideas, and supporting each other, we can build a future that is not only sustainable but also enriching for all. 
Every small step we take today paves the way for a brighter tomorrow.
"""

In [5]:
sent_tokenize(corpus)

[' The world around us is constantly evolving, driven by advancements in technology and the endless pursuit of knowledge.',
 'As we adapt to new challenges and opportunities, it is important to remain open-minded and embrace change.',
 'The key to thriving in this dynamic environment is not only to stay informed but also to foster creativity and innovation.',
 'By working together, sharing ideas, and supporting each other, we can build a future that is not only sustainable but also enriching for all.',
 'Every small step we take today paves the way for a brighter tomorrow.']

***Here, the paragraph is converted into sentences, returned as elements in a list***

In [6]:
## Conversting paragraph to words
from nltk.tokenize import word_tokenize

In [7]:
word_tokenize(corpus)

['The',
 'world',
 'around',
 'us',
 'is',
 'constantly',
 'evolving',
 ',',
 'driven',
 'by',
 'advancements',
 'in',
 'technology',
 'and',
 'the',
 'endless',
 'pursuit',
 'of',
 'knowledge',
 '.',
 'As',
 'we',
 'adapt',
 'to',
 'new',
 'challenges',
 'and',
 'opportunities',
 ',',
 'it',
 'is',
 'important',
 'to',
 'remain',
 'open-minded',
 'and',
 'embrace',
 'change',
 '.',
 'The',
 'key',
 'to',
 'thriving',
 'in',
 'this',
 'dynamic',
 'environment',
 'is',
 'not',
 'only',
 'to',
 'stay',
 'informed',
 'but',
 'also',
 'to',
 'foster',
 'creativity',
 'and',
 'innovation',
 '.',
 'By',
 'working',
 'together',
 ',',
 'sharing',
 'ideas',
 ',',
 'and',
 'supporting',
 'each',
 'other',
 ',',
 'we',
 'can',
 'build',
 'a',
 'future',
 'that',
 'is',
 'not',
 'only',
 'sustainable',
 'but',
 'also',
 'enriching',
 'for',
 'all',
 '.',
 'Every',
 'small',
 'step',
 'we',
 'take',
 'today',
 'paves',
 'the',
 'way',
 'for',
 'a',
 'brighter',
 'tomorrow',
 '.']

In [8]:
from nltk.tokenize import wordpunct_tokenize

In [9]:
wordpunct_tokenize(corpus)

['The',
 'world',
 'around',
 'us',
 'is',
 'constantly',
 'evolving',
 ',',
 'driven',
 'by',
 'advancements',
 'in',
 'technology',
 'and',
 'the',
 'endless',
 'pursuit',
 'of',
 'knowledge',
 '.',
 'As',
 'we',
 'adapt',
 'to',
 'new',
 'challenges',
 'and',
 'opportunities',
 ',',
 'it',
 'is',
 'important',
 'to',
 'remain',
 'open',
 '-',
 'minded',
 'and',
 'embrace',
 'change',
 '.',
 'The',
 'key',
 'to',
 'thriving',
 'in',
 'this',
 'dynamic',
 'environment',
 'is',
 'not',
 'only',
 'to',
 'stay',
 'informed',
 'but',
 'also',
 'to',
 'foster',
 'creativity',
 'and',
 'innovation',
 '.',
 'By',
 'working',
 'together',
 ',',
 'sharing',
 'ideas',
 ',',
 'and',
 'supporting',
 'each',
 'other',
 ',',
 'we',
 'can',
 'build',
 'a',
 'future',
 'that',
 'is',
 'not',
 'only',
 'sustainable',
 'but',
 'also',
 'enriching',
 'for',
 'all',
 '.',
 'Every',
 'small',
 'step',
 'we',
 'take',
 'today',
 'paves',
 'the',
 'way',
 'for',
 'a',
 'brighter',
 'tomorrow',
 '.']

In [10]:
from nltk.tokenize import TreebankWordTokenizer

In [11]:
tokenizer = TreebankWordTokenizer()

In [12]:
tokenizer.tokenize(corpus)

['The',
 'world',
 'around',
 'us',
 'is',
 'constantly',
 'evolving',
 ',',
 'driven',
 'by',
 'advancements',
 'in',
 'technology',
 'and',
 'the',
 'endless',
 'pursuit',
 'of',
 'knowledge.',
 'As',
 'we',
 'adapt',
 'to',
 'new',
 'challenges',
 'and',
 'opportunities',
 ',',
 'it',
 'is',
 'important',
 'to',
 'remain',
 'open-minded',
 'and',
 'embrace',
 'change.',
 'The',
 'key',
 'to',
 'thriving',
 'in',
 'this',
 'dynamic',
 'environment',
 'is',
 'not',
 'only',
 'to',
 'stay',
 'informed',
 'but',
 'also',
 'to',
 'foster',
 'creativity',
 'and',
 'innovation.',
 'By',
 'working',
 'together',
 ',',
 'sharing',
 'ideas',
 ',',
 'and',
 'supporting',
 'each',
 'other',
 ',',
 'we',
 'can',
 'build',
 'a',
 'future',
 'that',
 'is',
 'not',
 'only',
 'sustainable',
 'but',
 'also',
 'enriching',
 'for',
 'all.',
 'Every',
 'small',
 'step',
 'we',
 'take',
 'today',
 'paves',
 'the',
 'way',
 'for',
 'a',
 'brighter',
 'tomorrow',
 '.']

***The difference is between sentences, the fullstop is present with previous word. However the last sentence full stop is seperated. You can check in the example.***