# **Text Preprocessing**

## **Tokenization**
Tokenization is the process of replacing sensitive data (like credit card numbers) with unique, non-sensitive placeholders called tokens, or breaking down text into smaller units (words/subwords) for NLP. It enhances data security by removing PII from systems and enables efficient AI language processing. 

### Terminoligies
- **Corpus**: A whole Paragraph
- **Document**: A sentence
- **Vocabulary**: All the *Unique* words
- **Words**: All the words (total)

### Implementation

In [64]:
corpus = """
Pizza is one of the world’s most beloved foods, known for its comforting flavors and endless variety. From simple street slices to gourmet creations, it brings people together across cultures and continents.
Pizza originated in Naples, where flatbreads topped with tomatoes and cheese became popular among working-class families in the 18th and 19th centuries. The famous Margherita pizza, made with tomatoes, mozzarella, and basil, was created in honor of Queen Margherita of Savoy. Italian immigrants later introduced pizza to the United States, especially in New York City.
Today, pizza is a global phenomenon, adapted to local tastes and enjoyed in countless creative forms.
"""
print(corpus)


Pizza is one of the world’s most beloved foods, known for its comforting flavors and endless variety. From simple street slices to gourmet creations, it brings people together across cultures and continents.
Pizza originated in Naples, where flatbreads topped with tomatoes and cheese became popular among working-class families in the 18th and 19th centuries. The famous Margherita pizza, made with tomatoes, mozzarella, and basil, was created in honor of Queen Margherita of Savoy. Italian immigrants later introduced pizza to the United States, especially in New York City.
Today, pizza is a global phenomenon, adapted to local tastes and enjoyed in countless creative forms.



In [65]:
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize

# breaking into sentence
documents = sent_tokenize(corpus)
documents

['\nPizza is one of the world’s most beloved foods, known for its comforting flavors and endless variety.',
 'From simple street slices to gourmet creations, it brings people together across cultures and continents.',
 'Pizza originated in Naples, where flatbreads topped with tomatoes and cheese became popular among working-class families in the 18th and 19th centuries.',
 'The famous Margherita pizza, made with tomatoes, mozzarella, and basil, was created in honor of Queen Margherita of Savoy.',
 'Italian immigrants later introduced pizza to the United States, especially in New York City.',
 'Today, pizza is a global phenomenon, adapted to local tastes and enjoyed in countless creative forms.']

In [66]:
for doc in documents:
    print(word_tokenize(doc))

['Pizza', 'is', 'one', 'of', 'the', 'world', '’', 's', 'most', 'beloved', 'foods', ',', 'known', 'for', 'its', 'comforting', 'flavors', 'and', 'endless', 'variety', '.']
['From', 'simple', 'street', 'slices', 'to', 'gourmet', 'creations', ',', 'it', 'brings', 'people', 'together', 'across', 'cultures', 'and', 'continents', '.']
['Pizza', 'originated', 'in', 'Naples', ',', 'where', 'flatbreads', 'topped', 'with', 'tomatoes', 'and', 'cheese', 'became', 'popular', 'among', 'working-class', 'families', 'in', 'the', '18th', 'and', '19th', 'centuries', '.']
['The', 'famous', 'Margherita', 'pizza', ',', 'made', 'with', 'tomatoes', ',', 'mozzarella', ',', 'and', 'basil', ',', 'was', 'created', 'in', 'honor', 'of', 'Queen', 'Margherita', 'of', 'Savoy', '.']
['Italian', 'immigrants', 'later', 'introduced', 'pizza', 'to', 'the', 'United', 'States', ',', 'especially', 'in', 'New', 'York', 'City', '.']
['Today', ',', 'pizza', 'is', 'a', 'global', 'phenomenon', ',', 'adapted', 'to', 'local', 'tastes

In [67]:
# treats punctunation as separate

for doc in documents:
    print(wordpunct_tokenize(doc))

['Pizza', 'is', 'one', 'of', 'the', 'world', '’', 's', 'most', 'beloved', 'foods', ',', 'known', 'for', 'its', 'comforting', 'flavors', 'and', 'endless', 'variety', '.']
['From', 'simple', 'street', 'slices', 'to', 'gourmet', 'creations', ',', 'it', 'brings', 'people', 'together', 'across', 'cultures', 'and', 'continents', '.']
['Pizza', 'originated', 'in', 'Naples', ',', 'where', 'flatbreads', 'topped', 'with', 'tomatoes', 'and', 'cheese', 'became', 'popular', 'among', 'working', '-', 'class', 'families', 'in', 'the', '18th', 'and', '19th', 'centuries', '.']
['The', 'famous', 'Margherita', 'pizza', ',', 'made', 'with', 'tomatoes', ',', 'mozzarella', ',', 'and', 'basil', ',', 'was', 'created', 'in', 'honor', 'of', 'Queen', 'Margherita', 'of', 'Savoy', '.']
['Italian', 'immigrants', 'later', 'introduced', 'pizza', 'to', 'the', 'United', 'States', ',', 'especially', 'in', 'New', 'York', 'City', '.']
['Today', ',', 'pizza', 'is', 'a', 'global', 'phenomenon', ',', 'adapted', 'to', 'local',

## **Stemming**
Stemming in NLP is a text preprocessing technique that reduces words to their base or root form (e.g., "running", "runs", "runner" become "run") by removing suffixes and prefixes. It is used for normalization in search engines, chatbots, and text mining to improve efficiency by reducing vocabulary size. 

### Implementation

In [68]:
words = ['running', 'runs', 'eating', 'walking', 'eaten', 'walks', 'goes', 'go', 'going', 'fairly', 'sportingly']

#### Porter Stemmer

In [69]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

for w in words:
    print(f"{w}: {porter.stem(w)}")

running: run
runs: run
eating: eat
walking: walk
eaten: eaten
walks: walk
goes: goe
go: go
going: go
fairly: fairli
sportingly: sportingli


#### Snowball stemmer
Porter stemmer works for most of the words but we can see it is not working properly for *eaten*, *fairly* and *sportingly*. That's why `SnowballStemmer` is used.

In [70]:
from nltk.stem import SnowballStemmer

snow = SnowballStemmer(language='english')

for w in words:
    print(f"{w}: {snow.stem(w)}")

running: run
runs: run
eating: eat
walking: walk
eaten: eaten
walks: walk
goes: goe
go: go
going: go
fairly: fair
sportingly: sport


#### RegexpStemmer
It allows the programmer to decide the suffix, which gives more accuracy.

In [71]:
from nltk.stem import RegexpStemmer

reg = RegexpStemmer(regexp="ing$|es$|ly$|$s|en$")

for w in words:
    print(f"{w}: {reg.stem(w)}")

running: runn
runs: runs
eating: eat
walking: walk
eaten: eat
walks: walks
goes: go
go: go
going: go
fairly: fair
sportingly: sporting


We can see this performs well in *eaten* also as we have defined "en$" in regex.

## **Lemmatization**
Lemmatization in NLP is a text normalization technique that reduces inflected or derived words to their base dictionary form, known as a lemma (e.g., "running" to "run," "better" to "good"). Unlike stemming, it utilizes vocabulary, morphological analysis, and part-of-speech (POS) tagging to ensure accuracy. It is crucial for improving search relevance, chatbot context, and reducing data redundancy. 

### Impelmentation

In [72]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

In [73]:
# basic use case
lem.lemmatize("running")

'running'

> "running" came as "running"? why?

Because by defualt, the wordNetLemmatizer assumes every word as 'Noun'. We have to define the `pos` parameter accordingly

- noun -> n
- adjective -> a
- adverb -> r
- verb -> v

In [74]:
lem.lemmatize("running", pos='v')   # now output will come as 'run'

'run'

However it is not prossible to hardcode every pos tag so for this puprose, spacy is better choice

In [75]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(" ".join(words))

for token in doc:
    print(f"{token.text} : {token.lemma_} ({token.pos_})")

running : run (VERB)
runs : run (NOUN)
eating : eat (VERB)
walking : walk (VERB)
eaten : eat (VERB)
walks : walk (NOUN)
goes : go (VERB)
go : go (VERB)
going : go (VERB)
fairly : fairly (ADV)
sportingly : sportingly (ADV)


## **Handling Stopwords**
Stop words in NLP are common, high-frequency words (e.g., "the," "is," "and," "in") filtered out during text preprocessing to reduce noise and improve model efficiency. They carry little semantic meaning, so removing them helps algorithms focus on significant words for tasks like search indexing, sentiment analysis, and topic modeling. 

In [76]:
from nltk.corpus import stopwords
stopwords.words('english')
# we need to filter out these words

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [77]:
sentence = "In the middle of the day, I was sitting on the porch and thinking about the things that I had not yet done."

stop = []
useful_words = []

for word in sentence.split(" "):
    if word in stopwords.words('english'):
        stop.append(word)
    else:
        useful_words.append(word)

print(f"Stopwatch: {stop}")
print(f"Useful words: {useful_words}")

Stopwatch: ['the', 'of', 'the', 'was', 'on', 'the', 'and', 'about', 'the', 'that', 'had', 'not']
Useful words: ['In', 'middle', 'day,', 'I', 'sitting', 'porch', 'thinking', 'things', 'I', 'yet', 'done.']


## Creating a whole pipleline for text preprocessing

In [78]:
class textPreprocessor:
    def __init__(self, corpus):
        self.corpus = corpus
    
    def process(self) -> list:
        # tokenise
        words = word_tokenize(self.corpus)

        # remove stopwords
        words = [word for word in words if word not in stopwords.words('english')]

        # find root word
        stem = spacy.load("en_core_web_sm")
        doc = stem(" ".join(words))

        # final list
        final_list = []
        for token in doc:
            final_list.append(token.lemma_)
        
        return final_list

In [79]:
pre = textPreprocessor(corpus=corpus)
list_of_words = pre.process()

In [80]:
" ".join(list_of_words)

"Pizza one world ' beloved food , know comfort flavor endless variety . from simple street slice gourmet creation , bring people together across culture continent . Pizza originate Naples , flatbread top tomato cheese become popular among working - class family 18th 19th century . the famous Margherita pizza , make tomato , mozzarella , basil , create honor Queen Margherita Savoy . italian immigrant later introduce pizza United States , especially New York City . today , pizza global phenomenon , adapt local taste enjoy countless creative form ."