In [1]:
!pip install nltk # Install the Natural Language Toolkit (NLTK) library



In [2]:
import nltk # Import the NLTK library

NLTK is very large in size. So we've to sometimes manually download it using punkt_tab.

In [3]:
nltk.download('punkt_tab')
# nltk.download() function from the NLTK library to download a specific dataset or model called 'punkt_tab'.
# The punkt_tab resource is a tokenizer model used for splitting text into sentences and words, and it's required by functions like sent_tokenize and word_tokenize.
# Downloading it makes these functions available for use.

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
txt="Hello Everyone. Campus is hoping you guys are doing well." # Define a sample text string
txt # Display the text string

'Hello Everyone. Campus is hoping you guys are doing well.'

In [5]:
txt.split('.') # Split the text into a list of strings based on the period '.' delimiter

['Hello Everyone', ' Campus is hoping you guys are doing well', '']

In [6]:
txt.split(' ') # Split the text into a list of strings based on the space ' ' delimiter

['Hello',
 'Everyone.',
 'Campus',
 'is',
 'hoping',
 'you',
 'guys',
 'are',
 'doing',
 'well.']

In [7]:
len(txt.split('.')) # Calculate the number of elements in the list obtained by splitting the text by '.'

3

In [8]:
len(txt.split(' ')) # Calculate the number of elements in the list obtained by splitting the text by space

10

In [9]:
from nltk.tokenize import word_tokenize,sent_tokenize # Import word_tokenize and sent_tokenize functions from nltk.tokenize
# These functions are used for tokenizing text into words and sentences respectively.

In [10]:
word_tokenize(txt) # Tokenize the text into words using NLTK's word_tokenize function

['Hello',
 'Everyone',
 '.',
 'Campus',
 'is',
 'hoping',
 'you',
 'guys',
 'are',
 'doing',
 'well',
 '.']

In [11]:
for word in word_tokenize(txt): # Iterate through each word obtained from tokenizing the text
  print(word) # Print each word

Hello
Everyone
.
Campus
is
hoping
you
guys
are
doing
well
.


In [12]:
for word in word_tokenize(txt): # Iterate through each word obtained from tokenizing the text
  if(word!='.'): # Check if the word is not a period '.'
    print(word) # Print the word if it's not a period

Hello
Everyone
Campus
is
hoping
you
guys
are
doing
well


In [13]:
sent_tokenize(txt) # Tokenize the text into sentences using NLTK's sent_tokenize function

['Hello Everyone.', 'Campus is hoping you guys are doing well.']

# Stemming and Lemmitisation
Techniques in NLP to reduce the vocab size(how many unique words we're having in the whole string)

In [14]:
# Download necessary NLTK data
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [15]:
# Import necessary modules for text normalization
from nltk.stem import WordNetLemmatizer,PorterStemmer # Import WordNetLemmatizer and PorterStemmer for text normalization

In [16]:
# Initialize stemmer and lemmatizer
stem=PorterStemmer()
lam= WordNetLemmatizer()
#both are used to find the root words but in lemmatisation root word make sense but in stemming it might may not make sense(not available in dictionary)

#Lemmitization

**1-** Lemmatization, on the other hand, is a more sophisticated process that reduces words to their base or dictionary form, which is called a "lemma". The lemma is always a valid word.

**2-** Lemmatization considers the context and part of speech of a word to determine its lemma. For example, the words "running", "runs", and "ran" would be lemmatized to "run", while "better" would be lemmatized to "good".



In [17]:
# Demonstrate lemmatization
print(lam.lemmatize('change')) # Lemmatize the word 'change'
print(lam.lemmatize('changes')) # Lemmatize the word 'changes'
print(lam.lemmatize('changer')) # Lemmatize the word 'changer'
print(lam.lemmatize('changed')) # Lemmatize the word 'changed'
print(lam.lemmatize('changess')) # Lemmatize the word 'changess'
#less data loss

change
change
changer
changed
changess


#Stemming

**1-** Stemming is a process that reduces
words to their root or base form, which is called a "stem". The stem might not be an actual word in the dictionary.

**2-** Stemming is a cruder process and often results in words that are not linguistically correct but are sufficient for many text processing tasks. For example, the words "running", "runs", and "ran" might all be reduced to the stem "run".

In [18]:
# Demonstrate stemming
print(stem.stem('change')) # Stem the word 'change'
print(stem.stem('changes')) # Stem the word 'changes'
print(stem.stem('changer')) # Stem the word 'changer'
print(stem.stem('changed')) # Stem the word 'changed'
print(stem.stem('changing')) # Stem the word 'changing'
#Huge data loss

chang
chang
changer
chang
chang


# Comparison
In essence, stemming is faster and simpler, while lemmatization is more accurate and produces linguistically correct results, but it is also more computationally expensive.