## **Text Preprocessing**

**Text preprocessing** involves several sequential steps that clean and normalize textual data. Raw text often contains inconsistencies, noise, and irrelevant information that can negatively impact model performance. Through systematic preprocessing, we can extract meaningful features and improve the quality of our text analysis.

In [10]:
%pip install pandas 
%pip install spacy

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## **Section 0: Creating Data Sets**

In [11]:

# Import Pandas library

import pandas as pd


**Purpose:** Import the essential pandas library for data manipulation and analysis.



In [12]:

data = [

"When life gives you lemons, make lemonade! 🙂",

"She bought 2 lemons for $1 at Maven Market.",

"A dozen lemons will make a gallon of lemonade. [AllRecipes]",

"lemon, lemon, lemons, lemon, lemon, lemons",

"He's running to the market to get a lemon — there's a great sale today.",

"Does Maven Market carry Eureka lemons or Meyer lemons?",

"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",

"iced tea is my favorite"

]


**Purpose:** Create a sample dataset with diverse text challenges including:

- Mixed case letters
- Punctuation marks
- Special characters and emojis
- Numbers and currency symbols
- Contractions and apostrophes
- Citations in brackets
- Repeated words

In [13]:

# Convert list to DataFrame

data_df = pd.DataFrame(data, columns=['sentence'])


**Purpose:** Transform the list into a pandas DataFrame for easier manipulation and analysis.



In [14]:

# Set display options to show full content

pd.set_option('display.max_colwidth', None)


**Purpose:** Configure pandas to display complete text content without truncation, essential for examining preprocessing results.



## **Section 1: Preprocessing**

### **Normalization**

**Text normalization** is the process of converting text to a standard, consistent format. The most common normalization technique is converting all text to lowercase, which ensures that words like "Apple" and "apple" are treated as the same token.



In [15]:

# Create a copy for spaCy processing

spacy_df = data_df.copy()



# Convert text to lowercase

spacy_df['clean_sentence'] = spacy_df['sentence'].str.lower()


In [16]:
spacy_df

Unnamed: 0,sentence,clean_sentence
0,"When life gives you lemons, make lemonade! 🙂","when life gives you lemons, make lemonade! 🙂"
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for $1 at maven market.
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade. [allrecipes]
3,"lemon, lemon, lemons, lemon, lemon, lemons","lemon, lemon, lemons, lemon, lemon, lemons"
4,He's running to the market to get a lemon — there's a great sale today.,he's running to the market to get a lemon — there's a great sale today.
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons?
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]","an arnold palmer is half lemonade, half iced tea. [wikipedia]"
7,iced tea is my favorite,iced tea is my favorite


**Purpose:**
- Create a working copy to preserve original data
- Convert all text to lowercase for consistency
- Store results in a new column called 'clean_sentence'

**Key Learning:** Normalization reduces vocabulary size and prevents case-sensitive duplicates from being treated as different words.



### **Text Cleaning**

In [17]:

# Remove specific citations

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace('[wikipedia]', '')



# Advanced cleaning with regex

combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(combined, ' ', regex=True)

spacy_df['clean_sentence'] = spacy_df['clean_sentence'].str.replace(r'\s+', ' ', regex=True).str.strip()


In [18]:
spacy_df

Unnamed: 0,sentence,clean_sentence
0,"When life gives you lemons, make lemonade! 🙂",when life gives you lemons make lemonade
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for 1 at maven market
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade allrecipes
3,"lemon, lemon, lemons, lemon, lemon, lemons",lemon lemon lemons lemon lemon lemons
4,He's running to the market to get a lemon — there's a great sale today.,he s running to the market to get a lemon there s a great sale today
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",an arnold palmer is half lemonade half iced tea
7,iced tea is my favorite,iced tea is my favorite


**Purpose:**

- Remove citations and references
- Use regular expressions to remove URLs, email addresses, social media handles, and non-alphanumeric characters
- Normalize whitespace by replacing multiple spaces with single spaces



## **Section 1.2: Advanced Text Processing with spaCy**

**spaCy** is an advanced NLP library that intelligently processes text by understanding grammar and word relationships, providing better tokenization and lemmatization than basic string methods.

In [19]:

import spacy



# Download and install English language model

!python -m spacy download en_core_web_sm



# Load the pre-trained pipeline

nlp = spacy.load('en_core_web_sm')



# Process a sample sentence

phrase = spacy_df.clean_sentence[0] # "when life gives you lemons make lemonade"

doc = nlp(phrase)




Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m75.3 kB/s[0m  [33m0:02:30[0mm0:00:03[0m00:07[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### **1.2.1 Tokenization**

**Tokenization** splits text into individual units (tokens) such as words, punctuation marks, or numbers. Modern tokenizers handle complex cases like contractions, compound words, and special characters intelligently.



In [20]:

# Extract tokens as text strings

[token.text for token in doc]

# Output: ['when', 'life', 'gives', 'you', 'lemons', 'make', 'lemonade']



# Extract tokens as spaCy objects (with linguistic attributes)

[token for token in doc]

# Output: [when, life, gives, you, lemons, make, lemonade]


[when, life, gives, you, lemons, make, lemonade]

**Purpose:**

- Demonstrate two ways to access tokens
- Show how spaCy preserves linguistic information in token objects



### **1.2.2 Lemmatization**


**Lemmatization** reduces words to their base or root form (lemma) using linguistic knowledge. Unlike stemming, which simply removes suffixes, lemmatization considers the word's part of speech and meaning to find the correct root form.

Examples:

- "running" → "run"
- "better" → "good"
- "mice" → "mouse"

In [21]:

# Extract lemmatized forms

[token.lemma_ for token in doc]

# Output: ['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']


['when', 'life', 'give', 'you', 'lemon', 'make', 'lemonade']

**Purpose:**

- Convert words to their dictionary forms
- Reduce vocabulary size by grouping inflected forms
- Note how "gives" becomes "give" and "lemons" becomes "lemon"



### **1.2.3 Stop Words Removal**

**Stop words** are common words that carry little semantic meaning and are often filtered out to focus on more meaningful content. Examples include "the", "and", "is", "in", etc.

In [22]:

# View all English stop words in spaCy

list(nlp.Defaults.stop_words)

print(f"Total stop words: {len(list(nlp.Defaults.stop_words))}") # 326 stop words



# Remove stop words

[token for token in doc if  not token.is_stop]

# Output: [life, gives, lemons, lemonade]



# Combine lemmatization and stop word removal

[token.lemma_ for token in doc if  not token.is_stop]

# Output: ['life', 'give', 'lemon', 'lemonade']



# Convert back to sentence format

norm = [token.lemma_ for token in doc if  not token.is_stop]

' '.join(norm) # Output: 'life give lemon lemonade'


Total stop words: 326


'life give lemon lemonade'

In [23]:
spacy_df

Unnamed: 0,sentence,clean_sentence
0,"When life gives you lemons, make lemonade! 🙂",when life gives you lemons make lemonade
1,She bought 2 lemons for $1 at Maven Market.,she bought 2 lemons for 1 at maven market
2,A dozen lemons will make a gallon of lemonade. [AllRecipes],a dozen lemons will make a gallon of lemonade allrecipes
3,"lemon, lemon, lemons, lemon, lemon, lemons",lemon lemon lemons lemon lemon lemons
4,He's running to the market to get a lemon — there's a great sale today.,he s running to the market to get a lemon there s a great sale today
5,Does Maven Market carry Eureka lemons or Meyer lemons?,does maven market carry eureka lemons or meyer lemons
6,"An Arnold Palmer is half lemonade, half iced tea. [Wikipedia]",an arnold palmer is half lemonade half iced tea
7,iced tea is my favorite,iced tea is my favorite


**Purpose:**

- Show the extensive stop word list in spaCy (326 words)
- Demonstrate filtering out common, low-information words
- Combine multiple preprocessing steps for maximum effect



## **Section 2: Creating Reusable Functions**

Creating modular, reusable functions is essential for maintainable code and consistent preprocessing across different datasets.



In [24]:
# Function for lemmatization and stop word removal

def token_lemmastopw(text):
    doc = nlp(text)
    output = [token.lemma_ for token in doc if not token.is_stop]
    return ' '.join(output)

# Apply to entire dataset
spacy_df["clean_sentence"] = spacy_df["clean_sentence"].apply(token_lemmastopw)


**Purpose:**

- Encapsulate preprocessing logic in a reusable function
- Enable consistent processing across multiple texts
- Demonstrate functional programming approach to text processing



## **Section 3: Complete NLP Pipeline**

An **NLP pipeline** combines multiple preprocessing steps into a single, streamlined workflow. This approach ensures consistency and makes it easy to apply the same transformations to new data.



In [25]:
# Function to lowercase and replace unwanted patterns
def lower_replace(series):
    output = series.str.lower()
    combined = r'https?://\S+|www\.\S+|<.*?>|\S+@\S+\.\S+|@\w+|#\w+|[^A-Za-z0-9\s]'
    output = output.str.replace(combined, ' ', regex=True)
    return output


# Complete NLP pipeline
def nlp_pipeline(series):
    output = lower_replace(series)
    output = output.apply(token_lemmastopw)  # make sure function name matches!
    return output


# Apply complete pipeline
cleaned_text = nlp_pipeline(data_df["sentence"])

# Save processed data for future use
pd.to_pickle(cleaned_text, "preprocessed_text.pkl")


**Purpose:**

- Combine all preprocessing steps into a single function
- Create a reproducible workflow
- Save processed data in pickle format for efficient loading

## **Section 4: Word Representation (Vectorization)**

**Vectorization** converts preprocessed text into numerical representations that machine learning algorithms can process. Text must be transformed into vectors (arrays of numbers) because algorithms cannot directly work with text strings.

### **4.1 Count Vectorization (Bag of Words)**


**Count Vectorization** creates a matrix where each row represents a document and each column represents a unique word in the corpus. Cell values indicate how many times each word appears in each document. This approach ignores word order but captures word frequency.

In [26]:

# Load preprocessed data
%pip install scikit-learn

import pandas as pd

series = pd.read_pickle('preprocessed_text.pkl')



from sklearn.feature_extraction.text import CountVectorizer



# Create Count Vectorizer

cv = CountVectorizer()

bow = cv.fit_transform(series)



# Convert to DataFrame for visualization

pd.DataFrame(bow.toarray(), columns=cv.get_feature_names_out())


Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,allrecipe,arnold,buy,carry,dozen,eureka,favorite,gallon,give,great,...,life,market,maven,meyer,palmer,run,sale,tea,today,wikipedia
0,0,0,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
2,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,1,1,0,1,0
5,0,0,0,1,0,1,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
7,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


**Purpose:**

- Transform text into numerical matrix representation
- Each column represents a unique word (feature)
- Each cell contains the count of that word in that document



#### **Advanced Count Vectorization**


In [27]:

# Count Vectorizer with filtering

cv1 = CountVectorizer(

stop_words='english', # Remove English stop words

ngram_range=(1,1), # Use only single words (unigrams)

min_df=2  # Include words that appear in at least 2 documents

)



bow1 = cv1.fit_transform(series)

bow1_df = pd.DataFrame(bow1.toarray(), columns=cv1.get_feature_names_out())



# Calculate term frequencies

term_freq = bow1_df.sum()


**Purpose:**

- Apply additional filtering to reduce noise
- Focus on words that appear multiple times across documents
- Calculate overall term frequencies for analysis

## **Section 5: TF-IDF (Term Frequency-Inverse Document Frequency)**


**TF-IDF** addresses a key limitation of simple count vectorization by considering both term frequency (how often a word appears in a document) and inverse document frequency (how rare the word is across the entire corpus).

**Formula:** 
**TF-IDF = TF × IDF**

**Components:**
**TF (Term Frequency):**
```
TF = Number of times word appears in document / Total words in document
```

**IDF (Inverse Document Frequency):**
```
IDF = log(Total documents / Documents containing the word)
```

**Key Insight:** TF-IDF gives higher weights to words that are frequent in a specific document but rare across the corpus, making them more distinctive and informative.

In [28]:

from sklearn.feature_extraction.text import TfidfVectorizer



# Basic TF-IDF vectorization

tv = TfidfVectorizer()

tvidf = tv.fit_transform(series)

tvidf_df = pd.DataFrame(tvidf.toarray(), columns=tv.get_feature_names_out())



# TF-IDF with filtering

tv1 = TfidfVectorizer(min_df=2) # Words must appear in at least 2 documents

tvidf1 = tv1.fit_transform(series)

tvidf1_df = pd.DataFrame(tvidf1.toarray(), columns=tv1.get_feature_names_out())


**Purpose:**

- Calculate TF-IDF scores for better feature weighting
- Values closer to 1 indicate highly distinctive words
- Values closer to 0 indicate either common words or absent words



### **N-gram Analysis**

In [29]:

# Bigram TF-IDF (pairs of consecutive words)

tv2 = TfidfVectorizer(ngram_range=(1,2)) # Include both unigrams and bigrams

tvidf2 = tv2.fit_transform(series)

tvidf2_df = pd.DataFrame(tvidf2.toarray(), columns=tv2.get_feature_names_out())



# Analyze feature importance

tvidf2_df.sum().sort_values(ascending=False)


lemon                 1.583310
lemon lemon           0.857624
market                0.767950
lemonade              0.743321
ice tea               0.625522
ice                   0.625522
tea                   0.625522
maven market          0.621858
maven                 0.621858
half                  0.505881
tea favorite          0.493436
favorite              0.493436
buy lemon             0.439482
buy                   0.439482
lemon maven           0.439482
life give             0.416207
life                  0.416207
give                  0.416207
give lemon            0.416207
lemon lemonade        0.416207
lemonade allrecipe    0.358685
lemon gallon          0.358685
allrecipe             0.358685
dozen                 0.358685
gallon                0.358685
dozen lemon           0.358685
gallon lemonade       0.358685
run                   0.319884
sale today            0.319884
market lemon          0.319884
sale                  0.319884
run market            0.319884
great   

**Purpose:**

- Capture phrase-level information with bigrams
- Examples: "arnold palmer", "buy lemon", "ice tea"
- Preserve some context that unigrams lose