# Normalization and Stemming

Based on **Stats Wire** video: https://www.youtube.com/watch?v=I8HNBp8ReLg&list=PLBSCvBlTOLa_wS8iy84DfyizdSs7ps7L5&index=7

## Normalization

Normalizing refers to the process of transforming text data into a standard, consisten format. It involves various steps such as:
+ **Converting all text to lowercase**: This ensures that the same word in different cases is treated as identical (e.g., "apple" and "Apple").
+ **Removing punctuation**: Punctuation marks are eliminated from the text as tey do not contribute to the meaning of the words.
+ **Removing stop words**: Stop words are common words like "a", "an", "the", etc., that occur frequently in a language but typically do not carry significant meaning. Removing stop words helps reduce noise and focus on important content words.
+ **Removing numerical digits or special characters**: Numbers or special characters that do not carry specific meaning in the context of text analysis can be removed. 
+ **Removing HTML tags or URLs**: If the text contains HTML tags or URLs, they can be stripped out.
+ **Handling contractions**: Converting contractions like "don't" to "do not"  ensures consistency in word usage.

The goal of normalization is to create a cleaner and more standardized representation of the text, making it easier for further analysis and processing

## Stemming

Stemming is a technique for reducing words to their base or root form, known as the **stem**. The stem may not be an actual word itself, but it represents the core meaning of the word. Stemming helps to reduce inflected or derived words to a common form so that variations of the same word are treated as identical. Stemming algorithms apply linguistic rules nad heuristics to strip off prefixes and suffixes from words. aiming to identify the common stem. For example, stemming would reduce words like "running", "runs", and "ran" to the common stem "run". Thissimplification of words can help with tasks such as information retrieval, text classification, and clustering. It's important to note that stemming is a rule-based process and may not always produce accurate results. It can sometimes generate stems that are not actual words. In such cases, Lemmatization, which considers the word's part of speech and context to determine its base form (**lemma**), can be more linguistically accurate.

In [1]:
import nltk

In [2]:
text1 = "Work works working workings worked"

In [3]:
print(text1)

Work works working workings worked


In [4]:
text1.lower()

'work works working workings worked'

In [5]:
text1.lower().split(' ')

['work', 'works', 'working', 'workings', 'worked']

In [6]:
words1 = text1.lower().split(' ')

In [7]:
print(words1)

['work', 'works', 'working', 'workings', 'worked']


In [9]:
# saving PorterStemmer class in object
porter = nltk.PorterStemmer()

In [11]:
# stemming
[porter.stem(w) for w in words1]

['work', 'work', 'work', 'work', 'work']