## **Stemming Exercise**

An online job search platform allows users to search for job listings using free-text queries. The platform stores thousands of job descriptions and user queries in English. However, the search engine often fails to return relevant results because different grammatical forms of the same word are treated as separate terms.

For example, job descriptions may contain words like “developing”, “developer”, and “development”, while users may search for “develop”. This mismatch reduces the accuracy of search results.

To improve search relevance and reduce vocabulary size, the platform plans to apply stemming as part of its text preprocessing pipeline.

## **Problem Statement**

You are part of a data science team responsible for enhancing the search functionality of the job portal. Your task is to analyze how stemming can help normalize textual data and compare the behavior of different stemming techniques.

You are given a dataset containing:

1. Job descriptions

2. User search queries

Before applying any machine learning or information retrieval techniques, the text must be preprocessed using stemming algorithms.

In [None]:
import nltk

In [None]:
# Download required NLTK Resources
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# Job Descriptions and user queries

job_descriptions = [
    "We are looking for a software developer who enjoys developing scalable applications.",
    "The candidate should have experience in data analysis and analyzing large datasets.",
    "Designing, testing, and maintaining systems is a key responsibility."
]

user_queries = [
    "develop software",
    "analyze data",
    "design system"
]


In [None]:
job_descriptions = [
    "The organization was responsible for organizing multiple international conferences.",
    "Relational databases require normalization and relational mapping techniques.",
    "The system's conditional logic depends on configurable conditions."
]

user_queries = [
    "organize conference",
    "relational database",
    "conditional system"
]

In [None]:
# Text Preprocessing

def preprocess_text(text):
    text = text.lower() # To convert into lowercase characters
    text = text.translate(str.maketrans('', '', string.punctuation)) # To Remove Punctuations
    tokens = nltk.word_tokenize(text) # Tokenizer
    return tokens

# string.punctuation is a constant string containing all ASCII punctuation characters:!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ Contains 32 characters

# str.maketrans() creates a translation table — a mapping used by translate().
# str.maketrans(x, y, z)
# x-->Characters to Replace
# y-->Characters to replace with
# z-->Characters to delete


In [None]:
import string

text = "Hello, world! How are you doing today? I'm fine-thanks. I don't care"
text = text.translate(str.maketrans('', '', string.punctuation))
print(text)


Hello world How are you doing today Im finethanks I dont care


In [None]:
# Preprocess Job Description and User Queries

processed_jobs = [preprocess_text(desc) for desc in job_descriptions]
processed_queries = [preprocess_text(query) for query in user_queries]

processed_jobs, processed_queries


([['the',
   'organization',
   'was',
   'responsible',
   'for',
   'organizing',
   'multiple',
   'international',
   'conferences'],
  ['relational',
   'databases',
   'require',
   'normalization',
   'and',
   'relational',
   'mapping',
   'techniques'],
  ['the',
   'systems',
   'conditional',
   'logic',
   'depends',
   'on',
   'configurable',
   'conditions']],
 [['organize', 'conference'],
  ['relational', 'database'],
  ['conditional', 'system']])

In [None]:
# Initialize Stemmer
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")


In [None]:
# Apply porter Stemmer
porter_stemmed_jobs = [
    [porter_stemmer.stem(word) for word in desc]
    for desc in processed_jobs
]

porter_stemmed_queries = [
    [porter_stemmer.stem(word) for word in query]
    for query in processed_queries
]

porter_stemmed_jobs, porter_stemmed_queries


([['the',
   'organ',
   'wa',
   'respons',
   'for',
   'organ',
   'multipl',
   'intern',
   'confer'],
  ['relat', 'databas', 'requir', 'normal', 'and', 'relat', 'map', 'techniqu'],
  ['the', 'system', 'condit', 'logic', 'depend', 'on', 'configur', 'condit']],
 [['organ', 'confer'], ['relat', 'databas'], ['condit', 'system']])

In [None]:
# Apply Snowball Stemmer
snowball_stemmed_jobs = [
    [snowball_stemmer.stem(word) for word in desc]
    for desc in processed_jobs
]

snowball_stemmed_queries = [
    [snowball_stemmer.stem(word) for word in query]
    for query in processed_queries
]

snowball_stemmed_jobs, snowball_stemmed_queries


([['the',
   'organ',
   'was',
   'respons',
   'for',
   'organ',
   'multipl',
   'intern',
   'confer'],
  ['relat', 'databas', 'requir', 'normal', 'and', 'relat', 'map', 'techniqu'],
  ['the', 'system', 'condit', 'logic', 'depend', 'on', 'configur', 'condit']],
 [['organ', 'confer'], ['relat', 'databas'], ['condit', 'system']])

In [None]:
# Compare Stemming Results
for i in range(len(processed_jobs)):
    print(f"\nJob Description Statement {i+1}")
    print("Original Tokens:", processed_jobs[i])
    print("Porter Stemmer:", porter_stemmed_jobs[i])
    print("Snowball Stemmer:", snowball_stemmed_jobs[i])



Job Description Statement 1
Original Tokens: ['the', 'organization', 'was', 'responsible', 'for', 'organizing', 'multiple', 'international', 'conferences']
Porter Stemmer: ['the', 'organ', 'wa', 'respons', 'for', 'organ', 'multipl', 'intern', 'confer']
Snowball Stemmer: ['the', 'organ', 'was', 'respons', 'for', 'organ', 'multipl', 'intern', 'confer']

Job Description Statement 2
Original Tokens: ['relational', 'databases', 'require', 'normalization', 'and', 'relational', 'mapping', 'techniques']
Porter Stemmer: ['relat', 'databas', 'requir', 'normal', 'and', 'relat', 'map', 'techniqu']
Snowball Stemmer: ['relat', 'databas', 'requir', 'normal', 'and', 'relat', 'map', 'techniqu']

Job Description Statement 3
Original Tokens: ['the', 'systems', 'conditional', 'logic', 'depends', 'on', 'configurable', 'conditions']
Porter Stemmer: ['the', 'system', 'condit', 'logic', 'depend', 'on', 'configur', 'condit']
Snowball Stemmer: ['the', 'system', 'condit', 'logic', 'depend', 'on', 'configur', 'c

## Lemmatization


In [None]:
# Import Required Libraries.
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


In [None]:
# Download required NLTK Resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')  # OMW = Open Multilingual WordNet, Version 1.4 is a large, curated dataset that links WordNet concepts across multiple languages.
nltk.download('averaged_perceptron_tagger') # Part-of-Speech (POS) tagging model called Averaged Perceptron Tagger.
nltk.download('averaged_perceptron_tagger_eng') #English-specific version of NLTK’s Averaged Perceptron Part-of-Speech tagger.



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [None]:
# Sample Support Ticket Data
support_tickets = [
    "Customers were running into issues while installing the application.",
    "The system crashes when users are logging in repeatedly.",
    "Files were deleted accidentally and could not be recovered.",
    "The devices are connected but not responding properly."
]

In [None]:
support_tickets = [
    "Users were running multiple processes in the background.",
    "Files were deleted accidentally by the system.",
    "The application crashes when customers are logging in.",
    "Devices were connected but stopped responding suddenly.",
    "The system configurations were changed without permission.",
    "Customers complained that the services were not working properly."
]

In [None]:
# Text Preprocessing
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = nltk.word_tokenize(text)
    return tokens


In [None]:
# Preprocess Support Tickets
processed_tickets = [preprocess_text(ticket) for ticket in support_tickets]
processed_tickets


[['users',
  'were',
  'running',
  'multiple',
  'processes',
  'in',
  'the',
  'background'],
 ['files', 'were', 'deleted', 'accidentally', 'by', 'the', 'system'],
 ['the',
  'application',
  'crashes',
  'when',
  'customers',
  'are',
  'logging',
  'in'],
 ['devices', 'were', 'connected', 'but', 'stopped', 'responding', 'suddenly'],
 ['the',
  'system',
  'configurations',
  'were',
  'changed',
  'without',
  'permission'],
 ['customers',
  'complained',
  'that',
  'the',
  'services',
  'were',
  'not',
  'working',
  'properly']]

In [None]:
# Intialize WordLemmatizer
lemmatizer = WordNetLemmatizer()


In [None]:
# Lemmatization without POS Tagging
lemmatized_no_pos = [
    [lemmatizer.lemmatize(word) for word in ticket]
    for ticket in processed_tickets
]

lemmatized_no_pos


[['user', 'were', 'running', 'multiple', 'process', 'in', 'the', 'background'],
 ['file', 'were', 'deleted', 'accidentally', 'by', 'the', 'system'],
 ['the', 'application', 'crash', 'when', 'customer', 'are', 'logging', 'in'],
 ['device', 'were', 'connected', 'but', 'stopped', 'responding', 'suddenly'],
 ['the',
  'system',
  'configuration',
  'were',
  'changed',
  'without',
  'permission'],
 ['customer',
  'complained',
  'that',
  'the',
  'service',
  'were',
  'not',
  'working',
  'properly']]

In [None]:
# POS Tagging Function
# Convert NLTK POS tags to WordNet POS tags.
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


In [None]:
#Lemmatization with POS Tagging
lemmatized_with_pos = []

for ticket in processed_tickets:
    pos_tags = nltk.pos_tag(ticket)
    lemmas = [
        lemmatizer.lemmatize(word, get_wordnet_pos(pos))
        for word, pos in pos_tags
    ]
    lemmatized_with_pos.append(lemmas)

lemmatized_with_pos


[['user', 'be', 'run', 'multiple', 'process', 'in', 'the', 'background'],
 ['file', 'be', 'delete', 'accidentally', 'by', 'the', 'system'],
 ['the', 'application', 'crash', 'when', 'customer', 'be', 'log', 'in'],
 ['device', 'be', 'connect', 'but', 'stop', 'respond', 'suddenly'],
 ['the', 'system', 'configuration', 'be', 'change', 'without', 'permission'],
 ['customer',
  'complain',
  'that',
  'the',
  'service',
  'be',
  'not',
  'work',
  'properly']]

In [None]:
# Side by Side Comparison
for i in range(len(processed_tickets)):
    print(f"\nSupport Ticket {i+1}")
    print("Original Tokens:      ", processed_tickets[i])
    print("Lemmatized (No POS):  ", lemmatized_no_pos[i])
    print("Lemmatized (With POS):", lemmatized_with_pos[i])



Support Ticket 1
Original Tokens:       ['users', 'were', 'running', 'multiple', 'processes', 'in', 'the', 'background']
Lemmatized (No POS):   ['user', 'were', 'running', 'multiple', 'process', 'in', 'the', 'background']
Lemmatized (With POS): ['user', 'be', 'run', 'multiple', 'process', 'in', 'the', 'background']

Support Ticket 2
Original Tokens:       ['files', 'were', 'deleted', 'accidentally', 'by', 'the', 'system']
Lemmatized (No POS):   ['file', 'were', 'deleted', 'accidentally', 'by', 'the', 'system']
Lemmatized (With POS): ['file', 'be', 'delete', 'accidentally', 'by', 'the', 'system']

Support Ticket 3
Original Tokens:       ['the', 'application', 'crashes', 'when', 'customers', 'are', 'logging', 'in']
Lemmatized (No POS):   ['the', 'application', 'crash', 'when', 'customer', 'are', 'logging', 'in']
Lemmatized (With POS): ['the', 'application', 'crash', 'when', 'customer', 'be', 'log', 'in']

Support Ticket 4
Original Tokens:       ['devices', 'were', 'connected', 'but', 's