<a href="https://colab.research.google.com/github/rasik-nep/Natural-language-processing-/blob/main/TextPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization

Tokenization refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words.


*   Corpus: Body of text
*   Lexicon: Words and their meanings
*   Token : Each “entity” that is a part of whatever was split up based on rules



In [None]:
!pip install nltk

[31mERROR: Operation cancelled by user[0m[31m
[0m

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# dummy text
text ="""
    Imagine you're trying to teach a child to read. Instead of diving straight into complex paragraphs,
    you'd start by introducing them to individual letters, then syllables, and finally, whole words.
    In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable
    units for machines. The primary goal of tokenization is to represent text in a manner that's meaningful for
    machines without losing its context. By converting text into tokens, algorithms can more easily identify patterns.
    This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input.
    For instance, when a machine encounters the word "running", it doesn't see it as a singular entity but rather
    as a combination of tokens that it can analyze and derive meaning from.
"""

In [None]:
# Tokenizing sentences
sentences = sent_tokenize(text)
print(sentences)

["\n    Imagine you're trying to teach a child to read.", "Instead of diving straight into complex paragraphs,\n    you'd start by introducing them to individual letters, then syllables, and finally, whole words.", 'In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable\n    units for machines.', "The primary goal of tokenization is to represent text in a manner that's meaningful for\n    machines without losing its context.", 'By converting text into tokens, algorithms can more easily identify patterns.', 'This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input.', 'For instance, when a machine encounters the word "running", it doesn\'t see it as a singular entity but rather\n    as a combination of tokens that it can analyze and derive meaning from.']


In [None]:
# Tokenizing words
words = word_tokenize(text)
print(words)

['Imagine', 'you', "'re", 'trying', 'to', 'teach', 'a', 'child', 'to', 'read', '.', 'Instead', 'of', 'diving', 'straight', 'into', 'complex', 'paragraphs', ',', 'you', "'d", 'start', 'by', 'introducing', 'them', 'to', 'individual', 'letters', ',', 'then', 'syllables', ',', 'and', 'finally', ',', 'whole', 'words', '.', 'In', 'a', 'similar', 'vein', ',', 'tokenization', 'breaks', 'down', 'vast', 'stretches', 'of', 'text', 'into', 'more', 'digestible', 'and', 'understandable', 'units', 'for', 'machines', '.', 'The', 'primary', 'goal', 'of', 'tokenization', 'is', 'to', 'represent', 'text', 'in', 'a', 'manner', 'that', "'s", 'meaningful', 'for', 'machines', 'without', 'losing', 'its', 'context', '.', 'By', 'converting', 'text', 'into', 'tokens', ',', 'algorithms', 'can', 'more', 'easily', 'identify', 'patterns', '.', 'This', 'pattern', 'recognition', 'is', 'crucial', 'because', 'it', 'makes', 'it', 'possible', 'for', 'machines', 'to', 'understand', 'and', 'respond', 'to', 'human', 'input', 

# Stemming

Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.
For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”
There are different stemming alogrithms. Here we are using Porter Stemmer algo.

In [None]:
from nltk.stem import PorterStemmer
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# creating object for PortStemmer
stemmer = PorterStemmer()

Stopwords does not add much value to the overall context. So, we can remove them.

In [None]:
# Stemming
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

print(sentences)

["imagin 're tri teach child read .", "instead dive straight complex paragraph , 'd start introduc individu letter , syllabl , final , whole word .", 'similar vein , token break vast stretch text digest understand unit machin .', "primari goal token repr text manner 's mean machin without lose context .", 'convert text token , algorithm easili identifi pattern .', 'thi pattern recognit crucial make possibl machin understand respond human input .', "instanc , machin encount word `` run `` , n't see singular entiti rather combin token analyz deriv mean ."]


The problem with stemming is that it produces intermediate representation of word which may not have any meaning.

# Lemmatization

Lemmatization is the process of grouping together different inflected forms of the same word.
Lemmatization takes a word and breaks it down to its lemma. For example, the verb "walk" might appear as "walking," "walks" or "walked." Inflectional endings such as "s," "ed" and "ing" are removed. Lemmatization groups these words as its lemma, "walk."

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# dummy text
text ="""
    Imagine you're trying to teach a child to read. Instead of diving straight into complex paragraphs,
    you'd start by introducing them to individual letters, then syllables, and finally, whole words.
    In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable
    units for machines. The primary goal of tokenization is to represent text in a manner that's meaningful for
    machines without losing its context. By converting text into tokens, algorithms can more easily identify patterns.
    This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input.
    For instance, when a machine encounters the word "running", it doesn't see it as a singular entity but rather
    as a combination of tokens that it can analyze and derive meaning from.
"""
sentences = sent_tokenize(text)

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
# Lemmatization
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(words)

print(sentences)

["Imagine 're trying teach child read .", "Instead diving straight complex paragraph , 'd start introducing individual letter , syllable , finally , whole word .", 'In similar vein , tokenization break vast stretch text digestible understandable unit machine .', "The primary goal tokenization represent text manner 's meaningful machine without losing context .", 'By converting text token , algorithm easily identify pattern .', 'This pattern recognition crucial make possible machine understand respond human input .', "For instance , machine encounter word `` running '' , n't see singular entity rather combination token analyze derive meaning ."]
