# Stemming
It is a Natural Language Processing (NLP) technique used to reduce words to their root or base form, known as the stem.
A rule-based process that crudely chops off the ends (suffixes) of words.
**Example**: studies, studying, studious might all become studi.

# Lemmatization
In Natural Language Processing (NLP) is the process of reducing inflected words to their base or dictionary form, known as the lemma.

Unlike stemming, which often involves simply truncating word endings and can sometimes result in non-dictionary words, lemmatization considers the word's context and grammatical structure to determine its correct base form.
**For example**, "running," "ran," and "runs" would all be lemmatized to "run," while "better" would be lemmatized to "good." This contextual understanding often requires knowledge of the word's part of speech (e.g., noun, verb, adjective).

In [6]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [7]:
stemmer = PorterStemmer()

In [8]:
sentence = " The runners were running in a race and they ran very fast"

In [10]:
words = word_tokenize(sentence)
words

['The',
 'runners',
 'were',
 'running',
 'in',
 'a',
 'race',
 'and',
 'they',
 'ran',
 'very',
 'fast']

In [11]:
stemmer.stem('troubled')

'troubl'

In [12]:
# stemming
stemmed_words = [stemmer.stem(word) for word in words]

In [13]:
print(words)
print(stemmed_words)

['The', 'runners', 'were', 'running', 'in', 'a', 'race', 'and', 'they', 'ran', 'very', 'fast']
['the', 'runner', 'were', 'run', 'in', 'a', 'race', 'and', 'they', 'ran', 'veri', 'fast']


In [14]:
# lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\91880\AppData\Roaming\nltk_data...


True

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
sentence = " The runners were running in a race and they ran very fast"

In [None]:
words = word_tokenize(sentence)
words

['The',
 'runners',
 'were',
 'running',
 'in',
 'a',
 'race',
 'and',
 'they',
 'ran',
 'very',
 'fast']

In NLP, POS (Part-of-Speech) tagging is the process of assigning grammatical tags like noun, verb, adjective, etc., to each word in a text


In [None]:
lemmatized_words = [lemmatizer.lemmatize(word, pos= 'v') for word in words]

In [None]:
print(words)
print(lemmatized_words)

['The', 'runners', 'were', 'running', 'in', 'a', 'race', 'and', 'they', 'ran', 'very', 'fast']
['The', 'runners', 'be', 'run', 'in', 'a', 'race', 'and', 'they', 'run', 'very', 'fast']


In [None]:
# Handling Special Characters

In [None]:
#, @, $, upper to lower vice versa

In [None]:
# Regex

In [None]:
import re

In [None]:
text = 'The price of the iPhone $999.99! Call now at 123-456-7890.'

In [None]:
text = text.lower()

In [None]:
text_cleaned = re.sub(r'[^a-z\s]','', text)
text_cleaned = re.sub(r'[^a-z\s]','', text).strip()

In [None]:
text_cleaned

'the price of the iphone  call now at'

In [None]:
text_cleaned = re.sub(r'\s+',' ', text_cleaned)

In [None]:
text_cleaned

'the price of the iphone call now at'