# Text pre-processing in Python

This notebook demonstrates text preprocessing using Python. It covers basic built-in methods, NLTK, and SpaCy for common preprocessing tasks such as:

- Converting text to lowercase
- Removing punctuation
- Tokenization
- Lemmatization
- Stopword removal

In [2]:
# Install required packages
%pip install nltk spacy

Collecting spacy
  Downloading spacy-3.8.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.11-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy)
  Downloading thinc-8.3.3-cp312-cp312-manyl

In [4]:
# Example text
text = "This is an Example TEXT with Mixed CASE and Punctuations!!!"
print(text)

This is an Example TEXT with Mixed CASE and Punctuations!!!


## Basic Pre-processing with Python Built-in Methods

In [5]:
lowercase_text = text.lower()
print("Lowercase:", lowercase_text)

# Remove punctuations
import string
no_punctuation = ''.join(char for char in text if char not in string.punctuation)
print("Without Punctuation:", no_punctuation)

# Split into words
words = no_punctuation.split()
print("Words:", words)

Lowercase: this is an example text with mixed case and punctuations!!!
Without Punctuation: This is an Example TEXT with Mixed CASE and Punctuations
Words: ['This', 'is', 'an', 'Example', 'TEXT', 'with', 'Mixed', 'CASE', 'and', 'Punctuations']


## Basic Preprocessing with NLTK

In [8]:
# Import NLTK and download required resources
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Without Stopwords:", filtered_tokens)

[nltk_data] Downloading package punkt to /home/salar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/salar/nltk_data...


Tokens: ['This', 'is', 'an', 'Example', 'TEXT', 'with', 'Mixed', 'CASE', 'and', 'Punctuations', '!', '!', '!']
Without Stopwords: ['Example', 'TEXT', 'Mixed', 'CASE', 'Punctuations', '!', '!', '!']


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /home/salar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Basic Preprocessing with Spacy

In [7]:
# Load spaCy
import spacy
spacy.cli.download("en_core_web_sm")  # Download the English model if not already installed
nlp = spacy.load("en_core_web_sm")

# Process the text
doc = nlp(text)

# Lemmatization
lemmatized = [token.lemma_ for token in doc]
print("Lemmatized:", lemmatized)

# Remove stopwords and punctuations
filtered = [token.text for token in doc if not token.is_stop and not token.is_punct]
print("Filtered:", filtered)

TypeError: ForwardRef._evaluate() missing 1 required keyword-only argument: 'recursive_guard'