Natural Language Processing (NLP) involves a series of preprocessing steps to transform raw text data into a format suitable for analysis or machine learning models. These steps help improve the quality of the data and make it easier for algorithms to understand and process the text. Below are the key preprocessing steps used in NLP, along with explanations and example code.

# 1. Lowercasing
Convert all text to lowercase to ensure uniformity and avoid treating the same words in different cases as different tokens.

In [14]:
text = "Hello World! This is NLP."
text = text.lower()
print(text)

hello world! this is nlp.


# 2. Tokenization
Split the text into individual words or tokens. This is a fundamental step in NLP.

In [15]:
import nltk

nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize

text = "Hello World! This is NLP."
tokens = word_tokenize(text)
print(tokens)

['Hello', 'World', '!', 'This', 'is', 'NLP', '.']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# 3. Removing Punctuation
Punctuation marks are often unnecessary for analysis and can be removed.

In [16]:
import string

text = "Hello, World! This is NLP."
text = text.translate(str.maketrans('', '', string.punctuation))
print(text)

Hello World This is NLP


# 4. Removing Stopwords
Stopwords are common words (e.g., "the", "is", "and") that do not contribute much to the meaning of the text. Removing them can reduce noise.

In [17]:
import nltk

# Download the 'stopwords' dataset
nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
tokens = ["this", "is", "a", "sample", "sentence"]
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)  # Output: ['sample', 'sentence']

['sample', 'sentence']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 5. Stemming
Stemming reduces words to their root form by removing suffixes. It may not always result in a valid word.

In [18]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runner", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)  # Output: ['run', 'runner', 'ran']

['run', 'runner', 'ran']


# 6. Lemmatization
Lemmatization reduces words to their base or dictionary form (lemma). Unlike stemming, it ensures the result is a valid word.

In [19]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(lemmatized_words)  # Output: ['run', 'run', 'run']

['run', 'runner', 'run']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# 7. Removing Numbers
Numbers may not be relevant for text analysis and can be removed.

In [20]:
import re

text = "There are 3 apples and 5 oranges."
text = re.sub(r'\d+', '', text)
print(text)  # Output: "There are  apples and  oranges."

There are  apples and  oranges.


# 8. Removing Extra Spaces
Extra spaces can be removed to clean up the text.

In [21]:
text = "   This   is   a   sentence.   "
text = ' '.join(text.split())
print(text)  # Output: "This is a sentence."

This is a sentence.


# 9. Handling Contractions
Expand contractions (e.g., "can't" → "cannot") to standardize the text.

In [22]:
!pip install contractions
from contractions import fix

text = "I can't do this."
text = fix(text)
print(text)  # Output: "I cannot do this."

I cannot do this.


# 10. Removing Special Characters
Special characters (e.g., @, #, $) can be removed if they are not relevant.

In [23]:
import re

text = "This is a #sample text with @special characters!"
text = re.sub(r'[^\w\s]', '', text)
print(text)  # Output: "This is a sample text with special characters"

This is a sample text with special characters


#11. Part-of-Speech (POS) Tagging
Assign parts of speech (e.g., noun, verb) to each word in the text.

In [24]:
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download the required resource
nltk.download('averaged_perceptron_tagger_eng')

tokens = word_tokenize("This is a sample sentence.")
pos_tags = pos_tag(tokens)
print(pos_tags)  # Output: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('sentence', 'NN'), ('.', '.')]

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('sentence', 'NN'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


# 12. Named Entity Recognition (NER)
Identify and classify named entities (e.g., names, dates, locations) in the text.

In [26]:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize

# Download the required resources
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
# Download the 'maxent_ne_chunker_tab' resource
nltk.download('maxent_ne_chunker_tab') # This line is crucial to fix the error.

tokens = word_tokenize("John works at Google in New York.")
pos_tags = pos_tag(tokens)
ner_tags = ne_chunk(pos_tags)
print(ner_tags)

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


(S
  (PERSON John/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Google/NNP)
  in/IN
  (GPE New/NNP York/NNP)
  ./.)


# 13. Vectorization
Convert text into numerical representations (e.g., Bag of Words, TF-IDF, Word Embeddings) for machine learning models.

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["This is a sample sentence.", "Another example sentence."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())  # Output: [[1 1 1 1 0], [0 1 1 0 1]]
print(vectorizer.get_feature_names_out())  # Output: ['another', 'example', 'is', 'sample', 'sentence', 'this']

[[0 0 1 1 1 1]
 [1 1 0 0 1 0]]
['another' 'example' 'is' 'sample' 'sentence' 'this']


# 14. Handling Missing Data
If the dataset contains missing text, it can be filled or removed.

In [28]:
import pandas as pd

data = {"text": ["Hello", None, "World"]}
df = pd.DataFrame(data)
df["text"].fillna("My Dear", inplace=True)  # Fill missing values
print(df)

      text
0    Hello
1  My Dear
2    World


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["text"].fillna("My Dear", inplace=True)  # Fill missing values


# 15. Normalization
Normalize text by converting it to a standard format (e.g., Unicode normalization).

In [29]:
import unicodedata

text = "Café"
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
print(text)  # Output: "Cafe"

Cafe


# 16. Spelling Correction
Correct spelling errors in the text to improve consistency.

In [30]:
from textblob import TextBlob

text = "I made a many mistakes in Artificial intellengence"
blob = TextBlob(text)
corrected_text = blob.correct()
print(corrected_text)

I made a many mistakes in Artificial intelligence


#17. Handling Emojis and Emoticons
Convert emojis and emoticons to text or remove them, depending on the use case.

In [31]:
!pip install emoji

import emoji

text = "I love Python! 😊"
# Convert emojis to text
text = emoji.demojize(text)
print(text)  # Output: "I love Python! :smiling_face_with_smiling_eyes:"

# Remove emojis
text = emoji.replace_emoji(text, replace="")
print(text)  # Output: "I love Python! "

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/590.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m409.6/590.6 kB[0m [31m12.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1
I love Python! :smiling_face_with_smiling_eyes:
I love Python! :smiling_face_with_smiling_eyes:


# 18. Removing HTML Tags
If the text contains HTML tags, they can be removed.

In [32]:
from bs4 import BeautifulSoup

text = "<p>This is a <b>sample</b> text.</p>"
soup = BeautifulSoup(text, "html.parser")
clean_text = soup.get_text()
print(clean_text)  # Output: "This is a sample text."

This is a sample text.


# 19. Handling URLs
Remove or replace URLs in the text.

In [33]:
import re

text = "Visit my website at https://example.com."
text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
print(text)  # Output: "Visit my website at ."

Visit my website at 


# 20. Handling Mentions and Hashtags
Remove or process mentions (e.g., @username) and hashtags (e.g., #NLP) in social media text.



In [34]:
text = "Hey @user, check out #NLP!"
text = re.sub(r'@\w+|#\w+', '', text)
print(text)  # Output: "Hey , check out !"

Hey , check out !


# 21. Sentence Segmentation
Split a paragraph into individual sentences.

In [35]:
from nltk.tokenize import sent_tokenize

text = "This is the first sentence. This is the second sentence."
sentences = sent_tokenize(text)
print(sentences)  # Output: ['This is the first sentence.', 'This is the second sentence.']

['This is the first sentence.', 'This is the second sentence.']


# 22. Handling Abbreviations
Expand abbreviations to their full forms for better understanding.

In [36]:
!pip install contractions

import contractions

text = "I'll be there ASAP."
expanded_text = contractions.fix(text)
print(expanded_text)  # Output: "I will be there as soon as possible."


I will be there AS SOON AS POSSIBLE.


# 23. Language Detection
Identify the language of the text and filter out non-relevant languages.



In [37]:
!pip install langdetect

from langdetect import detect

text = "Ceci est un texte en français."
language = detect(text)
print(language)  # Output: "fr"

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m30.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993222 sha256=bbce2c6af92708087977e1fd3c221df7ecadf85a3428e8d49ecee05dc16a7b63
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
fr


# 24. Text Encoding
Ensure the text is in a consistent encoding format (e.g., UTF-8).

In [38]:
text = "Café"
text = text.encode('utf-8').decode('utf-8')
print(text)  # Output: "Café"

Café


#25. Handling Whitespace Tokens
Remove tokens that are purely whitespace.

In [39]:
tokens = ["This", " ", "is", " ", "a", " ", "sample", " "]
tokens = [token for token in tokens if token.strip()]
print(tokens)  # Output: ['This', 'is', 'a', 'sample']

['This', 'is', 'a', 'sample']


# 26. Handling Dates and Times
Normalize or remove dates and times from the text.

In [40]:
import dateutil.parser as dparser

text = "The event is on 2023-10-15."
date = dparser.parse(text, fuzzy=True)
print(date)  # Output: 2023-10-15 00:00:00

2023-10-15 00:00:00


# 27. Text Augmentation
Generate variations of the text to increase dataset size (useful for training).

In [41]:
!pip install nlpaug

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [42]:
#!pip install nlpaug # Install the nlpaug library
from nlpaug.augmenter.word import SynonymAug

aug = SynonymAug(aug_src='wordnet')
text = "This is a sample text."
augmented_text = aug.augment(text)
print(augmented_text)  # Output: "This is an example text."

['This is a sample schoolbook.']


# 28. Handling Negations
Detect and handle negations (e.g., "not good" → "not_good") to preserve meaning.

In [43]:
from nltk import word_tokenize

text = "This is not good."
tokens = word_tokenize(text)
for i, token in enumerate(tokens):
    if token == "not" and i + 1 < len(tokens):
        tokens[i + 1] = "not_" + tokens[i + 1]
print(tokens)  # Output: ['This', 'is', 'not', 'not_good', '.']

['This', 'is', 'not', 'not_good', '.']


#29. Dependency Parsing
Analyze the grammatical structure of a sentence.

In [44]:
import spacy

!python -m spacy download en_core_web_sm # Download the model if not already downloaded
nlp = spacy.load("en_core_web_sm")  # Load the model directly using spacy.load

# The rest of your code remains the same
text = "This is a sample sentence."
doc = nlp(text)
for token in doc:
    print(token.text, token.dep_, token.head.text)

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
This nsubj is
is ROOT is
a det sentence
sample compound sentence
sentence attr is
. punct is


# 30. Handling Rare Words
Replace rare words with a special token (e.g., <UNK>) to reduce vocabulary size.

In [45]:
from collections import Counter

tokens = ["this", "is", "a", "rare", "word", "word"]
word_counts = Counter(tokens)
rare_words = {word for word, count in word_counts.items() if count < 2}
tokens = [token if token not in rare_words else "<UNK>" for token in tokens]
print(tokens)  # Output: ['this', 'is', 'a', '<UNK>', 'word', 'word']

['<UNK>', '<UNK>', '<UNK>', '<UNK>', 'word', 'word']


# 31. Text Chunking
Group words into "chunks" based on their POS tags.



In [46]:
from nltk import pos_tag, word_tokenize
from nltk.chunk import RegexpParser

text = "This is a sample sentence."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunk_parser = RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)
print(tree)

(S This/DT is/VBZ (NP a/DT sample/JJ sentence/NN) ./.)


# 32. Handling Synonyms
Replace words with their synonyms to reduce redundancy.

In [47]:
from nltk.corpus import wordnet

word = "happy"
synonyms = wordnet.synsets(word)
print([syn.lemmas()[0].name() for syn in synonyms])  # Output: ['happy', 'felicitous', 'glad', 'well', 'content', ...]

['happy', 'felicitous', 'glad', 'happy']


# 33. Text Normalization for Social Media
Normalize social media text (e.g., "loooove" → "love").

In [48]:
import re

text = "I loooove this!"
text = re.sub(r'(.)\1+', r'\1', text)
print(text)  # Output: "I love this!"

I love this!
