# 1. What is the purpose of text preprocessing in NLP, and why is it essential before analysis?

Text preprocessing in NLP is essential to clean and prepare raw text data for analysis. It involves tasks like removing noise, tokenization, lowercasing, stemming/lemmatization, and handling stop words. This process improves model performance, reduces dimensionality, and enhances interpretability by highlighting relevant patterns and features in the text data.

# 2. Describe tokenization in NLP and explain its significance in text processing.

Tokenization is the process of breaking down a text into individual units, such as words or phrases (tokens).

In [25]:
from nltk.tokenize import word_tokenize
text = "This is Day one NLP."
print("="*125)
print("Original Text:\n\n\t",text)
tokens = word_tokenize(text)
print("="*125)
print("\nAfter Tokenization\n\n\t",tokens)

Original Text:

	 This is Day one NLP.

After Tokenization

	 ['This', 'is', 'Day', 'one', 'NLP', '.']


# 3. What are the differences between stemming and lemmatization in NLP? When would you choose one over the other?

Differences:
    
    Stemming reduces words to their base form by removing prefixes or suffixes.
    Lemmatization considers the context and converts words to their dictionary form.

    Use stemming for faster processing and in applications where a crude approximation of the root is acceptable.
    Use lemmatization for applications requiring the accurate base or dictionary form of words.

In [26]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
t = "This is Day one NLP."
stemmed_word = stemmer.stem(t)
lemmatized_word = lemmatizer.lemmatize(t, pos='v')
print("="*125)
print("\nStemmed:\n\n\t", stemmed_word)
print("="*125)
print("\nLemmatized:\n\n\t", lemmatized_word)



Stemmed:

	 this is day one nlp.

Lemmatized:

	 This is Day one NLP.


# 4. Explain the concept of stop words and their role in text preprocessing. How do they impact NLP tasks?

        ->Stop words are common words (e.g., "and," "the," "is") that are often removed during text preprocessing.
        
        ->They don't contribute significantly to the meaning, and their removal helps focus on meaningful content.

In [27]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_text = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print("="*125)
print("Original Text:\n\n\t",text)
print("="*125)
print("Filtered Text:\n\n\t",filtered_text)



Original Text:

	 This is Day one NLP.
Filtered Text:

	 ['Day', 'one', 'NLP', '.']


# 5. How does the process of removing punctuation contribute to text preprocessing in NLP? What are its benefits?

        Removing punctuation helps in focusing on the actual words and their meanings.
        
        Benefits: Enhances the efficiency of text analysis and prevents the model from treating words with punctuation as
                  different entities.

In [28]:
import string
text_no_punct = text.translate(str.maketrans("", "", string.punctuation))
print("="*125)
print("Original Text:\n\n\t",text)
print("="*125)
print("After Punctuation:\n\n\t",text_no_punct)


Original Text:

	 This is Day one NLP.
After Punctuation:

	 This is Day one NLP


# 6. Discuss the importance of lowercase conversion in text preprocessing. Why is it a common step in NLP tasks?
        
        Importance: Ensures uniformity and consistency in the representation of words, preventing the model from treating 
                    differently cased words as distinct.

In [29]:
text_lower = text.lower()
print("="*125)
print("Original Text:\n\n\t",text)
print("="*125)
print("After lower:\n\n\t",text_lower)


Original Text:

	 This is Day one NLP.
After lower:

	 this is day one nlp.


# 7. Explain the term "vectorization" concerning text data. How does techniques like CountVectorizer contribute to text preprocessing in NLP?

    Vectorization is the process of converting text data into numerical vectors.
    
    Techniques like CountVectorizer convert text into a matrix of token counts, making it suitable for machine 
    learning models.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
ex = ["This is a sample text.", "This is Day one NLP."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(ex)
print("="*125)
print("Original Text:\n\n\t",text)
print("="*125)
print("After Countvectorizer:\n\n",X.toarray())


Original Text:

	 This is Day one NLP.
After Countvectorizer:

 [[0 1 0 0 1 1 1]
 [1 1 1 1 0 0 1]]


# 8. Describe the concept of normalization in NLP. Provide examples of normalization techniques used in text preprocessing.

        Normalization involves transforming text to a standard or normalized form.


In [31]:
text_normalized = text.lower()
text_no_numbers = ''.join([word for word in text if not word.isdigit()])
print("="*125)
print("Original Text:\n\n\t",text)
print("="*125)
print("After Normalization:\n\n\t",text_normalized)
print("="*125)
print("Text Numbers:\n\n\t",text_no_numbers)

Original Text:

	 This is Day one NLP.
After Normalization:

	 this is day one nlp.
Text Numbers:

	 This is Day one NLP.
