1. Stemming

Definition:
Stemming is a rule-based process that chops off prefixes or suffixes from a word to get its root form. It doesn’t care whether the result is an actual valid word.

Example:

Original Word	Stemmed Form
Playing	Play
Played	Play
Studies	Studi
Better	Better (unchanged sometimes)
| Original Word | Stemmed Form                 |
| ------------- | ---------------------------- |
| Playing       | Play                         |
| Played        | Play                         |
| Studies       | Studi                        |
| Better        | Better (unchanged sometimes) |


How it works:

Uses simple rules (like removing “ing”, “ed”, “s”, etc.).

Fast and lightweight but sometimes inaccurate.

Popular Stemming Algorithms:

Porter Stemmer (most common)

Snowball Stemmer (improved Porter version)

Lancaster Stemmer (more aggressive)

running → run

ran → ran

runs → run

easily → easili

fairly → fairli


In [34]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download punkt if not already installed
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
sentence="The runner wer running in a race and they ran very fast"

In [36]:
stemmer=PorterStemmer()

In [37]:
stemmer.stem("Hystory")

'hystori'

In [38]:
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download the correct resource
nltk.download("punkt")
nltk.download('punkt_tab')


# Example usage
ps = PorterStemmer()
sentence = "The runners were running quickly and fairly."
words = word_tokenize(sentence)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [39]:
tokens = word_tokenize(sentence)
tokens

['The', 'runners', 'were', 'running', 'quickly', 'and', 'fairly', '.']

In [40]:
stemmed_words = [stemmer.stem(word) for word in tokens]

In [41]:
print(tokens)
print(stemmed_words)

['The', 'runners', 'were', 'running', 'quickly', 'and', 'fairly', '.']
['the', 'runner', 'were', 'run', 'quickli', 'and', 'fairli', '.']


2. Lemmatization

Definition:
Lemmatization uses vocabulary and morphological analysis to return the base or dictionary form of a word (called a lemma).
Unlike stemming, it ensures the result is a real word.

Example:

Original Word	Lemma
Playing	Play
Studies	Study
Better	Good
Running	Run

| Original Word | Lemma |
| ------------- | ----- |
| Playing       | Play  |
| Studies       | Study |
| Better        | Good  |
| Running       | Run   |


How it works:

Considers the context and part of speech (POS) of the word.

Uses a dictionary (like WordNet) to find the correct base form.

More accurate but slower than stemming.

playing → play

studies → study

better → good

running → run


In [42]:
import nltk
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

nltk.download("punkt")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [43]:
lemmetizer=WordNetLemmatizer()

In [44]:
lemmetized_words = [lemmetizer.lemmatize(word,pos = 'v') for word in tokens]

In [45]:
print(tokens)
#print(stemmed_words)
print(lemmetized_words)

['The', 'runners', 'were', 'running', 'quickly', 'and', 'fairly', '.']
['The', 'runners', 'be', 'run', 'quickly', 'and', 'fairly', '.']


In [46]:
text = "<html><body><p> Movie 1</p><p> Actor - Aamir Khan</p><p> Click here to <a href='http://google.com'>download</a></p></body></html>"

In [47]:
from ast import pattern
import re
def removal_html_tags(text):
  pattern=re.compile('<.*?>')
  return pattern.sub(r"",text)

In [48]:
removal_html_tags(text)

' Movie 1 Actor - Aamir Khan Click here to download'

In [49]:
chat_words = {
    'AFAIK':'As Far As I Know',
    'AFK':'Away From Keyboard',
    'ASAP':'As Soon As Possible',
    "FYI": "For Your Information",
    "ASAP": "As Soon As Possible",
    "BRB": "Be Right Back",
    "BTW": "By The Way",
    "OMG": "Oh My God",
    "IMO": "In My Opinion",
    "LOL": "Laugh Out Loud",
    "TTYL": "Talk To You Later",
    "GTG": "Got To Go",
    "TTYT": "Talk To You Tomorrow",
    "IDK": "I Don't Know",
    "TMI": "Too Much Information",
    "IMHO": "In My Humble Opinion",
    "ICYMI": "In Case You Missed It",
    "AFAIK": "As Far As I Know",
    "BTW": "By The Way",
    "FAQ": "Frequently Asked Questions",
    "TGIF": "Thank God It's Friday",
    "FYA": "For Your Action",
    "ICYMI": "In Case You Missed It",
}

In [50]:
chat_words["ASAP"]

'As Soon As Possible'

In [51]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [52]:
chat_conversion("go ASAP")

'go As Soon As Possible'