### Tokenization 

In [1]:
text = "Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens."

In [2]:
print(text)

Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token. Punctuation marks, words, and numbers can be considered tokens.


In [3]:
text.split(" ")

['Tokenization',
 'is',
 'the',
 'process',
 'of',
 'breaking',
 'down',
 'the',
 'given',
 'text',
 'in',
 'natural',
 'language',
 'processing',
 'into',
 'the',
 'smallest',
 'unit',
 'in',
 'a',
 'sentence',
 'called',
 'a',
 'token.',
 'Punctuation',
 'marks,',
 'words,',
 'and',
 'numbers',
 'can',
 'be',
 'considered',
 'tokens.']

### Sentence Tokenization 

Sentence tokenization is like cutting a big block of text into smaller pieces, where each piece is a sentence.

Because breaking text into sentences helps computers understand and work with the text better. It's like breaking down a big problem into smaller, more manageable parts. This way, computers can analyze and process the text more effectively.

In [4]:
import nltk
nltk.download('punkt')

from nltk import word_tokenize, sent_tokenize

sentence_tokens = sent_tokenize(text)
print(sentence_tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...


['Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token.', 'Punctuation marks, words, and numbers can be considered tokens.']


[nltk_data]   Package punkt is already up-to-date!


In [5]:
from nltk import word_tokenize, sent_tokenize

sentence_tokens = sent_tokenize(text)
print(sentence_tokens)

['Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token.', 'Punctuation marks, words, and numbers can be considered tokens.']


### Word Tokenization

In [6]:
word_tokens = word_tokenize(text)
print(word_tokens)

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'the', 'given', 'text', 'in', 'natural', 'language', 'processing', 'into', 'the', 'smallest', 'unit', 'in', 'a', 'sentence', 'called', 'a', 'token', '.', 'Punctuation', 'marks', ',', 'words', ',', 'and', 'numbers', 'can', 'be', 'considered', 'tokens', '.']


### Lemmatization

Lemmatization is the process of reducing words to their base or root form, known as the lemma. Unlike stemming, which simply chops off the ends of words, lemmatization considers the meaning of the word and converts it into its base form, which is linguistically valid.

In [9]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('punkt')
nltk.download('wordnet')

words = word_tokenize(text)

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

lemmatized_text = ' '.join(lemmatized_words)

print(lemmatized_text)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...


Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token . Punctuation mark , word , and number can be considered token .


### Part-of-speech Tagging

Part-of-speech (POS) tagging is the process of assigning a grammatical category (such as noun, verb, adjective, etc.) to each word in a sentence based on its syntactic role. It's an essential task in natural language processing (NLP) for understanding the structure and meaning of text.

In [10]:
from nltk import pos_tag

pos_tag(word_tokens)

[('Tokenization', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('process', 'NN'),
 ('of', 'IN'),
 ('breaking', 'VBG'),
 ('down', 'RP'),
 ('the', 'DT'),
 ('given', 'VBN'),
 ('text', 'NN'),
 ('in', 'IN'),
 ('natural', 'JJ'),
 ('language', 'NN'),
 ('processing', 'NN'),
 ('into', 'IN'),
 ('the', 'DT'),
 ('smallest', 'JJS'),
 ('unit', 'NN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('sentence', 'NN'),
 ('called', 'VBD'),
 ('a', 'DT'),
 ('token', 'NN'),
 ('.', '.'),
 ('Punctuation', 'NN'),
 ('marks', 'NNS'),
 (',', ','),
 ('words', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('numbers', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('considered', 'VBN'),
 ('tokens', 'NNS'),
 ('.', '.')]