<a href="https://colab.research.google.com/github/rasik-nep/Natural-language-processing-/blob/main/TextPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization

Tokenization refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.


*   Corpus: Body of text
*   Lexicon: Words and their meanings
*   Token : Each “entity” that is a part of whatever was split up based on rules



In [None]:
!pip install nltk

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# dummy text
text ="""
    Imagine you're trying to teach a child to read. Instead of diving straight into complex paragraphs,
    you'd start by introducing them to individual letters, then syllables, and finally, whole words.
    In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable
    units for machines. The primary goal of tokenization is to represent text in a manner that's meaningful for
    machines without losing its context. By converting text into tokens, algorithms can more easily identify patterns.
    This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input.
    For instance, when a machine encounters the word "running", it doesn't see it as a singular entity but rather
    as a combination of tokens that it can analyze and derive meaning from.
"""

In [None]:
# Tokenizing sentences
sentences = sent_tokenize(text)
print(sentences)

["\n    Imagine you're trying to teach a child to read.", "Instead of diving straight into complex paragraphs,\n    you'd start by introducing them to individual letters, then syllables, and finally, whole words.", 'In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable\n    units for machines.', "The primary goal of tokenization is to represent text in a manner that's meaningful for\n    machines without losing its context.", 'By converting text into tokens, algorithms can more easily identify patterns.', 'This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input.', 'For instance, when a machine encounters the word "running", it doesn\'t see it as a singular entity but rather\n    as a combination of tokens that it can analyze and derive meaning from.']


In [None]:
# Tokenizing words
words = word_tokenize(text)
print(words)

['Imagine', 'you', "'re", 'trying', 'to', 'teach', 'a', 'child', 'to', 'read', '.', 'Instead', 'of', 'diving', 'straight', 'into', 'complex', 'paragraphs', ',', 'you', "'d", 'start', 'by', 'introducing', 'them', 'to', 'individual', 'letters', ',', 'then', 'syllables', ',', 'and', 'finally', ',', 'whole', 'words', '.', 'In', 'a', 'similar', 'vein', ',', 'tokenization', 'breaks', 'down', 'vast', 'stretches', 'of', 'text', 'into', 'more', 'digestible', 'and', 'understandable', 'units', 'for', 'machines', '.', 'The', 'primary', 'goal', 'of', 'tokenization', 'is', 'to', 'represent', 'text', 'in', 'a', 'manner', 'that', "'s", 'meaningful', 'for', 'machines', 'without', 'losing', 'its', 'context', '.', 'By', 'converting', 'text', 'into', 'tokens', ',', 'algorithms', 'can', 'more', 'easily', 'identify', 'patterns', '.', 'This', 'pattern', 'recognition', 'is', 'crucial', 'because', 'it', 'makes', 'it', 'possible', 'for', 'machines', 'to', 'understand', 'and', 'respond', 'to', 'human', 'input', 

# Stemming

Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.
For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”
There are different stemming alogrithms. Here we are using Porter Stemmer algo.

In [None]:
from nltk.stem import PorterStemmer
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# creating object for PortStemmer
stemmer = PorterStemmer()

Stopwords does not add much value to the overall context. So, we can remove them.

In [None]:
# Stemming
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

print(sentences)

["imagin 're tri teach child read .", "instead dive straight complex paragraph , 'd start introduc individu letter , syllabl , final , whole word .", 'similar vein , token break vast stretch text digest understand unit machin .', "primari goal token repr text manner 's mean machin without lose context .", 'convert text token , algorithm easili identifi pattern .', 'thi pattern recognit crucial make possibl machin understand respond human input .', "instanc , machin encount word `` run `` , n't see singular entiti rather combin token analyz deriv mean ."]


The problem with stemming is that it produces intermediate representation of word which may not have any meaning.

# Lemmatization

Lemmatization is the process of grouping together different inflected forms of the same word.
Lemmatization takes a word and breaks it down to its lemma. For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its lemma, "walk."

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# dummy text
text ="""
    Imagine you're trying to teach a child to read. Instead of diving straight into complex paragraphs,
    you'd start by introducing them to individual letters, then syllables, and finally, whole words.
    In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable
    units for machines. The primary goal of tokenization is to represent text in a manner that's meaningful for
    machines without losing its context. By converting text into tokens, algorithms can more easily identify patterns.
    This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input.
    For instance, when a machine encounters the word "running", it doesn't see it as a singular entity but rather
    as a combination of tokens that it can analyze and derive meaning from.
"""
sentences = sent_tokenize(text)

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
# Lemmatization
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

print(sentences)

["Imagine 're trying teach child read .", "Instead diving straight complex paragraph , 'd start introducing individual letter , syllable , finally , whole word .", 'In similar vein , tokenization break vast stretch text digestible understandable unit machine .', "The primary goal tokenization represent text manner 's meaningful machine without losing context .", 'By converting text token , algorithm easily identify pattern .', 'This pattern recognition crucial make possible machine understand respond human input .', "For instance , machine encounter word `` running '' , n't see singular entity rather combination token analyze derive meaning ."]


# Bag of Words

A popular and simple method of feature extraction with text data.
This model is a simple document embedding technique based on word frequency.


In [42]:
text ="""
Nepal officially the Federal Democratic Republic of Nepal is a landlocked central Himalayan country in South Asia. Nepal is divided into 7 states and 77 districts and 753 local units including 6 metropolises, 11 sub-metropolises, 246 municipal councils and 481 villages. It has a population of 26.4 million and is the 93rd largest country by area. Bordering China in the north and India in the south, east, and west, it is the largest sovereign Himalayan state.  Nepal has a diverse geography, including fertile plains, subalpine forested hills, and eight of the world’s ten tallest mountains, including Mount Everest, the highest point on Earth. Kathmandu is the nation’s capital and largest city.
Nepal is a place of festivals . Festivals may be linked with the remembrance of the departed soul, to herald the different seasons, to mark the beginning or end of the agricultural cycle, to mark the national events, or just family celebrations.  On a festive day the Nepalese take their ritual bath, worship different gods and goddesses, visit temple, observe fasting and undertake feasting.
"""

In [56]:
# Cleaning the text
# Making all the words in lower case
# Also tokenization, Stemming and/or lemmatization is done
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

stemmer = PorterStemmer()
wordnet = WordNetLemmatizer()
sentences = nltk.sent_tokenize(text)
corpus=[]

In [55]:
# stemming
for i in range(len(sentences)):
  review = re.sub('[^a-zA-Z]',' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [stemmer.stem(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

print(corpus)

['nepal offici feder democrat republ nepal landlock central himalayan countri south asia', 'nepal divid state district local unit includ metropolis sub metropolis municip council villag', 'popul million rd largest countri area', 'border china north india south east west largest sovereign himalayan state', 'nepal divers geographi includ fertil plain subalpin forest hill eight world ten tallest mountain includ mount everest highest point earth', 'kathmandu nation capit largest citi', 'nepal place festiv', 'festiv may link remembr depart soul herald differ season mark begin end agricultur cycl mark nation event famili celebr', 'festiv day nepales take ritual bath worship differ god goddess visit templ observ fast undertak feast']


In [57]:
# lemmatization
for i in range(len(sentences)):
  review = re.sub('[^a-zA-Z]',' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

print(corpus)

['nepal officially federal democratic republic nepal landlocked central himalayan country south asia', 'nepal divided state district local unit including metropolis sub metropolis municipal council village', 'population million rd largest country area', 'bordering china north india south east west largest sovereign himalayan state', 'nepal diverse geography including fertile plain subalpine forested hill eight world ten tallest mountain including mount everest highest point earth', 'kathmandu nation capital largest city', 'nepal place festival', 'festival may linked remembrance departed soul herald different season mark beginning end agricultural cycle mark national event family celebration', 'festive day nepalese take ritual bath worship different god goddess visit temple observe fasting undertake feasting']


In [63]:
!pip uninstall sklearn -y

[0m

In [64]:
# creating the bag of words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
print(X)

[[0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 1 0 0
  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 2 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 1
  0 0 0 1 1 0 2 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0
  0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 

Disadvantage : All the words have the same weight so it is not suitable for Sentiment analysis

# TF-IDF

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

In [65]:
# clean the text
# lemmatization
for i in range(len(sentences)):
  review = re.sub('[^a-zA-Z]',' ', sentences[i])
  review = review.lower()
  review = review.split()
  review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

print(corpus)

['nepal officially federal democratic republic nepal landlocked central himalayan country south asia', 'nepal divided state district local unit including metropolis sub metropolis municipal council village', 'population million rd largest country area', 'bordering china north india south east west largest sovereign himalayan state', 'nepal diverse geography including fertile plain subalpine forested hill eight world ten tallest mountain including mount everest highest point earth', 'kathmandu nation capital largest city', 'nepal place festival', 'festival may linked remembrance departed soul herald different season mark beginning end agricultural cycle mark national event family celebration', 'festive day nepalese take ritual bath worship different god goddess visit temple observe fasting undertake feasting', 'nepal officially federal democratic republic nepal landlocked central himalayan country south asia', 'nepal divided state district local unit including metropolis sub metropolis 

In [66]:
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()
print(X)

[[0.         0.         0.30820435 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.43995275 0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.22169494 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.25259275]]
