# 1. What is the purpose of text preprocessing in NLP, and why is it essential before analysis?

Text preprocessing is a critical step in NLP that involves refining and structuring raw text data1to facilitate effective analysis, interpretation, and modeling. Text preprocessing techniques include

1)Lower casing: converting text data into lower case.

2)Tokenization: splitting text into words, phrases, symbols, etc.

3)Punctuation mark removal: removing punctuation marks from text.

4)Stemming: reducing words to their root form.

5)Lemmatization: reducing words to their base form using grammatical rules.

6)Part-of-speech tagging: assigning grammatical categories to words



# 2. Describe tokenization in NLP and explain its significance in text processing.

Tokenization is an essential part of natural language processing.

It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization.

There are three primary approaches to tokenization: rule-based, dictionary-based, and statistical-based.

In [3]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

text="""It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization."""
print('Original text:\n',text)
print()
tokenised_sent=sent_tokenize(text)
print('Aftr sentence Tokanization:\n',tokenised_sent)
print()
print('no of sentnces:\n',len(tokenised_sent))
print('='*70)
print()

print('Original text:\n',text)
print()
tokenised_word=word_tokenize(text)
print('Aftr word Tokanization:\n',tokenised_word)
print()
print('no of words:\n',len(tokenised_word))

Original text:
 It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization.

Aftr sentence Tokanization:
 ['It involves splitting a text into smaller pieces, known as tokens.', 'These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization.']

no of sentnces:
 2

Original text:
 It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization.

Aftr word Tokanization:
 ['It', 'involves', 'splitting', 'a', 'text', 'into', 'smaller', 'pieces', ',', 'known', 'as', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'words', ',', 'phrases', 'or', 'even', 'characters', 'an

# 3. What are the differences between stemming and lemmatization in NLP? When would you choose one over the other?

Stemming is a faster process than lemmatization, but lemmatization has higher accuracy.

Stemming is a rule-based approach, whereas lemmatization is a canonical dictionary-based approach.

Stemming chops off the word irrespective of the context, whereas lemmatization is context-dependent.

Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance.

Lemmatization uses corpus for stop words and WordNet corpus to produce lemma, whereas stemming algorithms don’t actually know the meaning of the word in the language it belongs to.


In [5]:
from nltk.stem import PorterStemmer

ps=PorterStemmer()

stemmed_words=[]

for w in tokenised_word:
    stemmed_words.append(ps.stem(w))
    

print('='*70)
print('Tokenized words - without stemming:\n\n\t',tokenised_word)
print('='*70)
print('\nTokenized words - afer stemming are:\n\t',stemmed_words)

from nltk.stem import WordNetLemmatizer


lemma=WordNetLemmatizer()


lemma_words=[lemma.lemmatize(word,pos='v') for word in tokenised_word ]

print('='*70)
print('lemmarized words:\n',lemma_words)

Tokenized words - without stemming:

	 ['It', 'involves', 'splitting', 'a', 'text', 'into', 'smaller', 'pieces', ',', 'known', 'as', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'words', ',', 'phrases', 'or', 'even', 'characters', 'and', 'are', 'the', 'basis', 'for', 'any', 'NLP', 'task', 'such', 'as', 'sentiment', 'analysis', ',', 'machine', 'translation', 'and', 'text', 'summarization', '.']

Tokenized words - afer stemming are:
	 ['it', 'involv', 'split', 'a', 'text', 'into', 'smaller', 'piec', ',', 'known', 'as', 'token', '.', 'these', 'token', 'can', 'be', 'word', ',', 'phrase', 'or', 'even', 'charact', 'and', 'are', 'the', 'basi', 'for', 'ani', 'nlp', 'task', 'such', 'as', 'sentiment', 'analysi', ',', 'machin', 'translat', 'and', 'text', 'summar', '.']
lemmarized words:
 ['It', 'involve', 'split', 'a', 'text', 'into', 'smaller', 'piece', ',', 'know', 'as', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'word', ',', 'phrase', 'or', 'even', 'character', 'and', 'be', 'the', 'basis

# 4. Explain the concept of stop words and their role in text preprocessing. How do they impact NLP tasks?

The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text.

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.


In [12]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

filtered_tokens=[]
for w in tokenised_word:
    if w not in stop_words:
        filtered_tokens.append(w)
print('Length of words:\t',len(tokenised_word))
print('='*70)
print('Tokenized words - with stop words:\n\n\t',tokenised_word)
print('='*70)
print('Length after the remoal of stopwords:\t',len(filtered_tokens))
print('='*70)
print('\nTokenized words - afer removing the stopwords are:\n\t',filtered_tokens)

Length of words:	 42
Tokenized words - with stop words:

	 ['It', 'involves', 'splitting', 'a', 'text', 'into', 'smaller', 'pieces', ',', 'known', 'as', 'tokens', '.', 'These', 'tokens', 'can', 'be', 'words', ',', 'phrases', 'or', 'even', 'characters', 'and', 'are', 'the', 'basis', 'for', 'any', 'NLP', 'task', 'such', 'as', 'sentiment', 'analysis', ',', 'machine', 'translation', 'and', 'text', 'summarization', '.']
Length after the remoal of stopwords:	 28

Tokenized words - afer removing the stopwords are:
	 ['It', 'involves', 'splitting', 'text', 'smaller', 'pieces', ',', 'known', 'tokens', '.', 'These', 'tokens', 'words', ',', 'phrases', 'even', 'characters', 'basis', 'NLP', 'task', 'sentiment', 'analysis', ',', 'machine', 'translation', 'text', 'summarization', '.']


# 5. How does the process of removing punctuation contribute to text preprocessing in NLP? What are its benefits?

Text cleaning or Text pre-processing is a mandatory step when we are working with text in Natural Language Processing (NLP).  In real-life human writable text data contain various words with the wrong spelling, short words, special symbols, emojis, etc. we need to clean this kind of noisy text data before feeding it to the machine learning model.

Many word embedding matrix support punctuation and special symbols. It that case, we need to retain punctuation as that models are aware of the difference between hurray and hurray!. Even in this scenario, the model works better with punctuation.

In [9]:
words=[word for word in tokenised_word if word.isalpha()]

print('Original text:\n',text)
print('='*70)

print('after removing puntuations:\n')
print(words)

Original text:
 It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization.
after removing puntuations:

['It', 'involves', 'splitting', 'a', 'text', 'into', 'smaller', 'pieces', 'known', 'as', 'tokens', 'These', 'tokens', 'can', 'be', 'words', 'phrases', 'or', 'even', 'characters', 'and', 'are', 'the', 'basis', 'for', 'any', 'NLP', 'task', 'such', 'as', 'sentiment', 'analysis', 'machine', 'translation', 'and', 'text', 'summarization']


# 6. Discuss the importance of lowercase conversion in text preprocessing. Why is it a common step in NLP tasks?

Lowercasing, or converting all characters in a text to lowercase, is a common and crucial step in text preprocessing for various natural language processing (NLP) tasks.Here are several reasons why lowercase conversion is important:

1)Uniform Representation.

2)Consistent Tokenization.

3)Normalization for Analysis.





In [10]:
lower_words=[word.lower() for word in tokenised_word]
print('Original text:\n',text)
print('='*70)

print('after lowering the case:\n')
print(lower_words)

Original text:
 It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis,  machine translation and text summarization.
after lowering the case:

['it', 'involves', 'splitting', 'a', 'text', 'into', 'smaller', 'pieces', ',', 'known', 'as', 'tokens', '.', 'these', 'tokens', 'can', 'be', 'words', ',', 'phrases', 'or', 'even', 'characters', 'and', 'are', 'the', 'basis', 'for', 'any', 'nlp', 'task', 'such', 'as', 'sentiment', 'analysis', ',', 'machine', 'translation', 'and', 'text', 'summarization', '.']


# 7. Explain the term "vectorization" concerning text data. How does techniques like CountVectorizer contribute to text preprocessing in NLP?

Vectorization in the context of text data refers to the process of converting textual information into numerical vectors that can be used as input for machine learning models.

CountVectorizer converts a collection of text documents into a matrix of token counts. Each row of the matrix corresponds to a document, and each column corresponds to a unique word (or token) in the entire collection. The values in the matrix represent the frequency of each word in the respective documents.

The output of CountVectorizer is typically a sparse matrix, where most of the entries are zero because not all words occur in every document. This sparse representation is memory-efficient, especially when dealing with large text corpora.



In [13]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vector=CountVectorizer()

x=vector.fit_transform(tokenised_word)

feature_names=vector.get_feature_names_out()

print('Feature names:\n',feature_names)
print('*'*70)
print('Token counts matrix:')
print(x.toarray())

Feature names:
 ['analysis' 'and' 'any' 'are' 'as' 'basis' 'be' 'can' 'characters' 'even'
 'for' 'into' 'involves' 'it' 'known' 'machine' 'nlp' 'or' 'phrases'
 'pieces' 'sentiment' 'smaller' 'splitting' 'such' 'summarization' 'task'
 'text' 'the' 'these' 'tokens' 'translation' 'words']
**********************************************************************
Token counts matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


# 8. Describe the concept of normalization in NLP. Provide examples of normalization techniques used in text preprocessing.

Normalization is a crucial step in Natural Language Processing (NLP) that involves cleaning and preprocessing text data to make it consistent and usable for different NLP tasks.

Normalization techniques used in text preprocessing are :

1)Case Normalization

2)Punctuation Removal

3)Stop Word Removal

4)Stemming

5)Lemmatization


