# Text Preprocessing Techniques with Examples

## 1. Lowercasing
**Definition:** Converting all characters in text to lowercase to ensure uniformity. This helps avoid treating the same words as different due to case differences.

In [1]:
text = "I love NLP! It's amazing 😊. Check out https://example.com for more details. Call me at +1-800-555-0199."
text = text.lower()
print(text)

i love nlp! it's amazing 😊. check out https://example.com for more details. call me at +1-800-555-0199.


## 2. Tokenization
**Definition:** Splitting text into smaller components such as words (word tokenization) or sentences (sentence tokenization).

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

nltk.download('punkt')

tokens = word_tokenize(text)
sentences = sent_tokenize(text)
print(tokens)  
# ['i', 'love', 'nlp', '!', 'it', "'s", 'amazing', '😊', '.', 'check', 'out', 'https', ':', '//example.com', 'for', 'more', 'details', '.', 'call', 'me', 'at', '+1-800-555-0199']
print(sentences)


## 3. Stopword Removal
**Definition:** Removing common words (like "the," "and," "is") that do not contribute much to the meaning of text but are frequent.

In [None]:
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)

## 4. Removing Punctuation
**Definition:** Removing punctuation marks (like `.`, `!`, `,`) to clean up text data for processing.

In [None]:
import string

text_no_punct = text.translate(str.maketrans('', '', string.punctuation))
print(text_no_punct)

## 5. Removing Numbers
**Definition:** Eliminating digits to reduce noise in the text, especially when numbers are not relevant.

In [None]:
import re

text_no_numbers = re.sub(r'\d+', '', text)
print(text_no_numbers)

## 6. Stemming
**Definition:** Reducing words to their root form by chopping off suffixes (e.g., "playing" becomes "play").

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed_words)

## 7. Lemmatization
**Definition:** Reducing words to their base or dictionary form (e.g., "running" becomes "run"). Unlike stemming, it preserves the actual meaning.

In [None]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_words)

## 8. Removing Extra Whitespaces
**Definition:** Removing multiple spaces and replacing them with a single space for text uniformity.

In [None]:
text = "I   love   NLP!  "
text_cleaned = " ".join(text.split())
print(text_cleaned)

## 9. Removing URLs
**Definition:** Eliminating website links from text to clean up irrelevant information.

In [None]:
text_no_urls = re.sub(r'http\S+|www\S+', '', text)
print(text_no_urls)

## 10. Removing HTML Tags
**Definition:** Extracting only the visible text from HTML markup.

In [None]:
from bs4 import BeautifulSoup

html_text = "<p>Hello <b>World</b></p>"
clean_text = BeautifulSoup(html_text, "html.parser").get_text()
print(clean_text)

## 11. Spelling Correction
**Definition:** Correcting misspelled words in text using context-based algorithms.

In [None]:
from textblob import TextBlob

text_with_typos = "I lov NLP!"
corrected_text = str(TextBlob(text_with_typos).correct())
print(corrected_text)

## 12. Removing Emojis
**Definition:** Filtering out emojis and non-ASCII characters from text.

In [None]:
text_no_emojis = re.sub(r'[^ -]+', '', text)
print(text_no_emojis)

## 13. Removing Special Characters
**Definition:** Removing symbols or characters that don't contribute to the text's semantic meaning.

In [None]:
text_clean = re.sub(r'[^a-zA-Z\s]', '', text)
print(text_clean)

## 14. Part-of-Speech Tagging (POS)
**Definition:** Assigning grammatical roles (like noun, verb, adjective) to each word in a text.

In [None]:
from nltk import pos_tag

pos_tags = pos_tag(tokens)
print(pos_tags)

## 15. Handling Contractions
**Definition:** Expanding contractions (like "don't" to "do not") for clarity.

In [None]:
import contractions

text_with_contractions = "I'm happy. He doesn't like NLP."
expanded_text = contractions.fix(text_with_contractions)
print(expanded_text)

## 16. Named Entity Recognition (NER)
**Definition:** Identifying proper nouns and classifying them into categories such as persons, locations, or organizations.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was born in Hawaii.")
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)

## 17. Removing Accented Characters
**Definition:** Normalizing characters with accents to plain ASCII characters.

In [None]:
import unicodedata

text_accented = "café naïve résumé"
text_normalized = unicodedata.normalize('NFKD', text_accented).encode('ascii', 'ignore').decode('utf-8')
print(text_normalized)