In [18]:
!pip install nltk



In [19]:
import nltk

In [20]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

1. Purpose of Text Preprocessing in NLP:
Text preprocessing in Natural Language Processing (NLP) serves several crucial purposes:

Noise Reduction: Cleaning and handling noisy data, such as HTML tags, special characters, and irrelevant information.

Normalization: Ensuring consistent representation of text, like converting all letters to lowercase.

Tokenization: Breaking down text into smaller units (tokens), like words or subwords.

Stemming and Lemmatization: Reducing words to their base or root form for analysis.

Removing Stop Words: Eliminating common words that don't contribute much to the overall meaning.

Vectorization: Converting text into numerical representations suitable for machine learning models.

2. Tokenization in NLP:
Tokenization is the process of breaking down a text into words or subwords (tokens). It is a critical step in text processing as it forms the foundation for further analysis. Tokenization helps in understanding the structure of a sentence or document, enabling the application of various NLP techniques. In Python, the nltk library is commonly used for tokenization:

In [21]:
from nltk.tokenize import word_tokenize

text = "Tokenization and Feature extraction is an essential step in NLP."
tokens = word_tokenize(text)
print(tokens)


['Tokenization', 'and', 'Feature', 'extraction', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.']


3.Stemming vs. Lemmatization: Stemming: Reducing words to their base or root form by removing suffixes. It is a faster but less accurate method.

In [22]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stemmed_word = stemmer.stem(word)
print(stemmed_word)

run


In [23]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Lemmatization: Obtaining the base or dictionary form of a word (lemma). It's slower but more accurate than stemming.

In [24]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
word = "running"
lemmatized_word = lemmatizer.lemmatize(word, pos='v')
print(lemmatized_word)


run


Choose stemming when speed is crucial and a little loss of accuracy is acceptable. Opt for lemmatization when precision is more critical.

4.Stop Words in Text Preprocessing: Stop words are common words like "the," "and," "is" that are often removed during text preprocessing. They don't carry much meaning and can be a source of noise in analysis. In Python, using the nltk library

In [25]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_text = [word for word in tokens if word.lower() not in stop_words]
print(filtered_text)

['Tokenization', 'Feature', 'extraction', 'essential', 'step', 'NLP', '.']


5.Removing Punctuation: Removing punctuation is essential to ensure consistency and focus on the actual words. It helps in reducing the dimensionality of the data and simplifies analysis.

In [27]:
import string

text = "Text with punctuation!"
clean_text = text.translate(str.maketrans("", "", string.punctuation))
print(clean_text)

Text with punctuation


6.Lowercase Conversion: Converting text to lowercase is a common step to ensure uniformity. It helps in treating words with different cases as the same word, reducing complexity.

In [28]:
text = "This is a Sample Tex for NLP."
lowercase_text = text.lower()
print(lowercase_text)


this is a sample tex for nlp.


7.Vectorization in NLP: Vectorization is the process of converting text data into numerical vectors. CountVectorizer is a popular technique that represents each document as a vector of word frequencies.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is a sample document.", "Another example document."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[0 1 0 1 1 1]
 [1 1 1 0 0 0]]


8.Normalization in NLP: Normalization involves transforming text data to a standard format. Techniques include lowercasing, stemming, lemmatization, and removing stop words.

In [30]:
normalized_text = [lemmatizer.lemmatize(word.lower()) for word in tokens if word.lower() not in stop_words]
print(normalized_text)

['tokenization', 'feature', 'extraction', 'essential', 'step', 'nlp', '.']
