Tokenization


###### Tokenization in Natural Language Processing (NLP) is the process of breaking down text into smaller units called tokens. These tokens can be words, phrases, or even characters. Tokenization is a crucial step in text preprocessing as it allows for the analysis and manipulation of text data at a granular level. It helps in various NLP tasks such as text analysis, machine learning, and information retrieval by converting the text into a structured format that can be easily processed by algorithms.

In [4]:
import nltk
# NLTK (Natural Language Toolkit) is a Python library that provides tools and resources for working with human language data. 
# It offers a wide range of functionalities for tasks such as tokenization, stemming, tagging, parsing, semantic reasoning, and more. 
# NLTK is widely used in natural language processing (NLP) research and applications.


In [3]:
#function to word tokenize a sentence
def word_tokenize_sentence(sentence):
    """
    Tokenizes a sentence into individual words using the nltk library.

    Parameters:
    sentence (str): The input sentence to be tokenized.

    Returns:
    list: A list of tokens representing the individual words in the sentence.
    """
    tokens = nltk.word_tokenize(sentence)
    return tokens

In [5]:
sample_sentences = [
    "I love programming.",  # First sample sentence
    "The sun is shining today.",  # Second sample sentence
    "She sings beautifully.",  # Third sample sentence
    "The cat is sleeping.",  # Fourth sample sentence
    "He enjoys playing video games."  # Fifth sample sentence
]
for sentence in sample_sentences:
    tokens = word_tokenize_sentence(sentence)
    print(tokens)


['I', 'love', 'programming', '.']
['The', 'sun', 'is', 'shining', 'today', '.']
['She', 'sings', 'beautifully', '.']
['The', 'cat', 'is', 'sleeping', '.']
['He', 'enjoys', 'playing', 'video', 'games', '.']


Stemming

###### Stemming in Natural Language Processing (NLP) is the process of reducing words to their base or root form, known as the stem. It involves removing suffixes from words to obtain the core meaning or essence of the word. Stemming is commonly used in text preprocessing tasks to normalize words and reduce the dimensionality of text data. By reducing words to their stems, stemming helps in tasks such as information retrieval, text classification, and sentiment analysis by treating different forms of the same word as a single entity.



In [24]:
# Stemming example 
# [going, gone, goes] ---> go

example_words = ['running', 'jumps', 'swimming', 'played', 'eating', 'happily', 'quickly', 'beautifully', 'friendly', 'lovely']


In [11]:
# PorterStemmer 
#The Porter Stemmer is an algorithm used in NLP to reduce words to their base or root form by applying a series of transformation rules.

from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()
for word in example_words:
    print(f"{word} -----> {porter_stemmer.stem(word)}")

running -----> run
jumps -----> jump
swimming -----> swim
played -----> play
eating -----> eat
happily -----> happili
quickly -----> quickli
beautifully -----> beauti
friendly -----> friendli
lovely -----> love


In [14]:
# Porter Stemmer has a disadvantage, it changes form for some of the words, below is the example
porter_stemmer.stem("congratulations")

'congratul'

In [19]:
#The RegexpStemmer is a class in the nltk.stem module that uses regular expressions to remove affixes from words, effectively reducing them to their root forms based on specified patterns.
from nltk.stem import RegexpStemmer

# Define a regular expression pattern for stemming, it will remove the last characters
pattern = 'ing$|s$|ed$|ly$'

# Initialize the RegexpStemmer with the pattern
regexp_stemmer = RegexpStemmer(pattern)

for word in example_words:
    print(f"{word} -----> {regexp_stemmer.stem(word)}")



running -----> runn
jumps -----> jump
swimming -----> swimm
played -----> play
eating -----> eat
happily -----> happi
quickly -----> quick
beautifully -----> beautiful
friendly -----> friend
lovely -----> love


In [22]:
#The Snowball Stemmer is an algorithm used in natural language processing (NLP) for reducing words to their base or root form.
#  It is an improvement over the Porter Stemmer and supports multiple languages.
#  The Snowball Stemmer is known for its balance between performance and accuracy.

from nltk.stem import SnowballStemmer

# Initialize the SnowballStemmer for English
snowball_stemmer = SnowballStemmer("english")

for word in example_words:
    print(f"{word} -----> {snowball_stemmer.stem(word)}")

running -----> run
jumps -----> jump
swimming -----> swim
played -----> play
eating -----> eat
happily -----> happili
quickly -----> quick
beautifully -----> beauti
friendly -----> friend
lovely -----> love


Lemmatization

###### Lemmatization is the process of reducing words to their base or dictionary form, known as a lemma (root word), by considering the context and morphological analysis of the words. Unlike stemming, which simply cuts off prefixes or suffixes, lemmatization uses vocabulary and morphological analysis to return the base form of a word, ensuring that the resulting lemma is a valid word in the language.

In [36]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Apply lemmatization to each word in the list
print("VERBS SET")
for word in example_words:
    print(f"{word} -----> {lemmatizer.lemmatize(word, pos=wordnet.VERB)}")


VERBS SET
running -----> run
jumps -----> jump
swimming -----> swim
played -----> play
eating -----> eat
happily -----> happily
quickly -----> quickly
beautifully -----> beautifully
friendly -----> friendly
lovely -----> lovely


In [39]:
print("NOUNS SET")

for word in example_words:
    print(f"{word} -----> {lemmatizer.lemmatize(word, pos=wordnet.NOUN)}")

NOUNS SET
running -----> running
jumps -----> jump
swimming -----> swimming
played -----> played
eating -----> eating
happily -----> happily
quickly -----> quickly
beautifully -----> beautifully
friendly -----> friendly
lovely -----> lovely


StopWords

###### Stopwords in Natural Language Processing (NLP) are common words that are often filtered out before processing text data. These words, such as "and", "the", "is", "in", and "at", typically do not carry significant meaning and are removed to focus on the more important words in the text. Removing stopwords helps in reducing the dimensionality of the data and improving the performance of NLP tasks like text classification, sentiment analysis, and information retrieval.


In [43]:
paragraph = """
The 1947 partition of Punjab was a significant and traumatic event in the history of the Indian subcontinent. 
It marked the division of British India into two independent dominions, India and Pakistan. 
Punjab, a region with a rich cultural and historical heritage, was split into East Punjab, which became part of India, and West Punjab, which became part of Pakistan.
This partition led to one of the largest mass migrations in human history, with millions of people crossing borders to join their chosen nation. 
The partition was accompanied by widespread violence, communal riots, and a humanitarian crisis, resulting in the loss of countless lives and the displacement of millions. 
The legacy of the partition continues to influence the socio-political landscape of both India and Pakistan to this day.
"""
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


# Tokenize the paragraph
words = word_tokenize(paragraph)

# Get the list of stopwords for English
stop_words = set(stopwords.words('english'))

# Filter out the stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]

# Print the filtered words
print(filtered_words)

['1947', 'partition', 'Punjab', 'significant', 'traumatic', 'event', 'history', 'Indian', 'subcontinent', '.', 'marked', 'division', 'British', 'India', 'two', 'independent', 'dominions', ',', 'India', 'Pakistan', '.', 'Punjab', ',', 'region', 'rich', 'cultural', 'historical', 'heritage', ',', 'split', 'East', 'Punjab', ',', 'became', 'part', 'India', ',', 'West', 'Punjab', ',', 'became', 'part', 'Pakistan', '.', 'partition', 'led', 'one', 'largest', 'mass', 'migrations', 'human', 'history', ',', 'millions', 'people', 'crossing', 'borders', 'join', 'chosen', 'nation', '.', 'partition', 'accompanied', 'widespread', 'violence', ',', 'communal', 'riots', ',', 'humanitarian', 'crisis', ',', 'resulting', 'loss', 'countless', 'lives', 'displacement', 'millions', '.', 'legacy', 'partition', 'continues', 'influence', 'socio-political', 'landscape', 'India', 'Pakistan', 'day', '.']


[('1947', 'CD'), ('partition', 'NN'), ('Punjab', 'NNP'), ('significant', 'JJ'), ('traumatic', 'JJ'), ('event', 'NN'), ('history', 'NN'), ('Indian', 'JJ'), ('subcontinent', 'NN'), ('.', '.'), ('marked', 'VBN'), ('division', 'NN'), ('British', 'JJ'), ('India', 'NNP'), ('two', 'CD'), ('independent', 'JJ'), ('dominions', 'NNS'), (',', ','), ('India', 'NNP'), ('Pakistan', 'NNP'), ('.', '.'), ('Punjab', 'NNP'), (',', ','), ('region', 'NN'), ('rich', 'JJ'), ('cultural', 'JJ'), ('historical', 'JJ'), ('heritage', 'NN'), (',', ','), ('split', 'VBD'), ('East', 'NNP'), ('Punjab', 'NNP'), (',', ','), ('became', 'VBD'), ('part', 'NN'), ('India', 'NNP'), (',', ','), ('West', 'NNP'), ('Punjab', 'NNP'), (',', ','), ('became', 'VBD'), ('part', 'NN'), ('Pakistan', 'NNP'), ('.', '.'), ('partition', 'NN'), ('led', 'VBD'), ('one', 'CD'), ('largest', 'JJS'), ('mass', 'NN'), ('migrations', 'NNS'), ('human', 'JJ'), ('history', 'NN'), (',', ','), ('millions', 'NNS'), ('people', 'NNS'), ('crossing', 'VBG'), ('bo

In [49]:
# To implement Named Entity Recognition (NER) using the nltk library in Python, you can follow these steps:

# Tokenize the text into sentences.
# Tokenize each sentence into words.
# Tag each word with its part of speech.
# Use a named entity chunker to identify named entities.
# Here's an example of how to do this:
# parts of speech tagging
from nltk import pos_tag
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag, ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Tokenize the text into sentences
sentences = sent_tokenize(paragraph)

# Tokenize each sentence into words and perform POS tagging

tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
pos_tagged_sentences = [pos_tag(sentence) for sentence in tokenized_sentences]

# Perform Named Entity Recognition
named_entities = [ne_chunk(sentence) for sentence in pos_tagged_sentences]

# Print the named entities
for tree in named_entities:
    print(tree)

(S
  The/DT
  1947/CD
  partition/NN
  of/IN
  (PERSON Punjab/NNP)
  was/VBD
  a/DT
  significant/JJ
  and/CC
  traumatic/JJ
  event/NN
  in/IN
  the/DT
  history/NN
  of/IN
  the/DT
  (GPE Indian/JJ)
  subcontinent/NN
  ./.)
(S
  It/PRP
  marked/VBD
  the/DT
  division/NN
  of/IN
  (GPE British/JJ)
  India/NNP
  into/IN
  two/CD
  independent/JJ
  dominions/NNS
  ,/,
  (GPE India/NNP)
  and/CC
  (GPE Pakistan/NNP)
  ./.)
(S
  (GPE Punjab/NNP)
  ,/,
  a/DT
  region/NN
  with/IN
  a/DT
  rich/JJ
  cultural/JJ
  and/CC
  historical/JJ
  heritage/NN
  ,/,
  was/VBD
  split/VBN
  into/IN
  (GPE East/NNP Punjab/NNP)
  ,/,
  which/WDT
  became/VBD
  part/NN
  of/IN
  (GPE India/NNP)
  ,/,
  and/CC
  (LOCATION West/NNP Punjab/NNP)
  ,/,
  which/WDT
  became/VBD
  part/NN
  of/IN
  (GPE Pakistan/NNP)
  ./.)
(S
  This/DT
  partition/NN
  led/VBD
  to/TO
  one/CD
  of/IN
  the/DT
  largest/JJS
  mass/NN
  migrations/NNS
  in/IN
  human/JJ
  history/NN
  ,/,
  with/IN
  millions/NNS
  of/IN
  peo

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
