# NLP Assignments  
**Author:** Raymundo Java Jr.

This notebook covers three essential NLP preprocessing assignments:

### Assignment 1: Removing Subsequent Occurrence of Words
Eliminate adjacent duplicate words to improve clarity and reduce noise.

### Assignment 2: Adding Custom Stop Words to `nltk` and `spacy`
Extend the default stop word lists with custom words (e.g., "customword1", "customword2", "customword3") for domain-specific text cleaning.

### Assignment 3: Stemming
Compare stemming techniques (PorterStemmer, LancasterStemmer, SnowballStemmer) using maintenance-related text to determine the best approach for technical documentation.

---

### Assignment -->  Use Case 1: Removing subsequent occurrence of words.

Removing subsequent occurrences of words (also known as deduplication of adjacent duplicate words) is a common preprocessing step in NLP. This task is important because: 
1. Repeated words can distort text analysis, especially in tasks like text summarization, sentiment analysis, and language modeling.
2. Removing redundant words improves the readability of the text, making it more coherent.
3. Reducing noise in the text data can improve the performance of machine learning models.

Input:
A single string text that may contain multiple sentences and words. Words are separated by spaces.

Output:
A single string with subsequent duplicate words removed.

Constraints:
The input string can be empty.
The words are case-sensitive, meaning "Word" and "word" are considered different.

In [45]:
def remove_adjacent_duplicate(text):
    """Remove adjacent duplicate words from text."""
    words = text.split()
    result = []
    for i, word in enumerate(words):
        if i == 0 or word != words[i - 1]:
            result.append(word)
    return " ".join(result)

In [46]:
test_cases = [
    ("The compressor compressor is overheating", "Basic duplicate removal"),
    (
        "Routine maintenance maintenance is essential for safe operations",
        "Maintenance redundancy",
    ),
    (
        "Engine repair repair improves system performance",
        "Engine repair duplication",
    ),
    ("Pump Pump performance is stable", "Case-sensitive check"),
    (
        "Inspection inspection procedures ensure safety",
        "Mixed uppercase-lowercase words",
    ),
    (
        "During maintenance, maintenance, safety checks are vital. Safety safety protocols are followed.",
        "Multiple sentence handling",
    ),
    (
        "Replacing filters filters improves overall efficiency. Efficiency efficiency boosts performance.",
        "Duplicate phrases in sentences",
    ),
    (
        "Valve! Valve! is critical.",
        "Punctuation should not affect word removal",
    ),
    (
        "Lubrication, lubrication, is key for operation.",
        "Comma-separated duplicate words",
    ),
    ("", "Empty string case"),
    ("Overhaul", "Single word case"),
    ("System efficiency efficiency is 98% 98%", "Numbers in text"),
    ("Maintenance Maintenance Maintenance is crucial", "Highly repetitive phrase"),
    (
        "Preventive maintenance preventive maintenance reduces downtime",
        "Maintenance context redundancy",
    ),
    (
        "Engine models models power modern maintenance applications",
        "Engine model duplication",
    ),
    (
        "In equipment inspection inspection, identifying faults faults is essential",
        "Equipment inspection redundancy",
    ),
    (
        "Operational operational checks and checks improve reliability reliability.",
        "Long text performance test",
    ),
]

for text, description in test_cases:
    print(f"Test: {description}")
    print(f"Input: {text}")
    print(f"Output: {remove_adjacent_duplicate(text)}\n")


Test: Basic duplicate removal
Input: The compressor compressor is overheating
Output: The compressor is overheating

Test: Maintenance redundancy
Input: Routine maintenance maintenance is essential for safe operations
Output: Routine maintenance is essential for safe operations

Test: Engine repair duplication
Input: Engine repair repair improves system performance
Output: Engine repair improves system performance

Test: Case-sensitive check
Input: Pump Pump performance is stable
Output: Pump performance is stable

Test: Mixed uppercase-lowercase words
Input: Inspection inspection procedures ensure safety
Output: Inspection inspection procedures ensure safety

Test: Multiple sentence handling
Input: During maintenance, maintenance, safety checks are vital. Safety safety protocols are followed.
Output: During maintenance, safety checks are vital. Safety safety protocols are followed.

Test: Duplicate phrases in sentences
Input: Replacing filters filters improves overall efficiency. Effi

---

### Assignment --> Use Case 2: Adding Custom Stop Words to `nltk` and `spacy`


Adding custom stop words is a crucial preprocessing step in NLP. This task is important because:

Customizing stop words allows for more flexible and relevant text cleaning tailored to specific use cases.
Adding domain-specific stop words improves the performance of text analysis and machine learning models by removing irrelevant terms.
Enhances the readability and coherence of the text by eliminating non-essential words.

Objective:
Extend the default stop words list in both `nltk` and `spacy` by adding custom stop words.

Input:
A list of custom stop words to be added to the existing stop words list in `nltk` and `spacy.

Output:
A function that takes a string and returns the text with both default and custom stop words removed.

Constraints:
The input string can be empty.
The words are case-sensitive, meaning "Word" and "word" are considered different.

Instructions:
Add custom stop words to `nltk`'s default stop words list.
Add custom stop words to `spacy`'s default stop words list.
Remove stop words from a given text using the updated stop words list for both `nltk` and `spacy.

**Note: Please ensure that the custom stop words you add are unique to your implementation. When testing and checking your notebooks, I will include these specific words to ensure they have been correctly added to your stop words list.**

Custom Stop Words to Use:    
"customword1";  
"customword2";  
"customword3" 

In [47]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [48]:
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [49]:
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jeng\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jeng\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [50]:
nltk_stop_words = set(stopwords.words("english"))
print("NLTK Default Stopwords:", list(nltk_stop_words)[:10])  # Print first 10 stopwords

nlp = spacy.load("en_core_web_sm")
spacy_stop_words = nlp.Defaults.stop_words
print("spaCy Default Stopwords:", list(spacy_stop_words)[:10])  # Print first 10 stopwords

NLTK Default Stopwords: ['because', 'further', 'hadn', 'once', "it's", 'shouldn', "haven't", 'above', 'again', 'had']
spaCy Default Stopwords: ['four', 'assembly', 'because', 'behind', 'once', 'even', 'above', 'own', 'hence', 'beside']


In [51]:
custom_stop_words = {
    "check", "checked", "perform", "performed", "process", "processed",
    "done", "completed", "required", "ensure", "ensured", "confirm", "confirmed",
    "unit", "component", "equipment", "system", "assembly", "section", 
    "scheduled", "unscheduled", "task", "tasks", "work", "procedure", "procedures",
    "operating", "operation", "operated", "service", "serviced", "maintain", "maintenance",
    "repair", "repaired", "issue", "issues", "inspection", "inspected"
}

# Add custom stopwords to NLTK
nltk_stop_words.update(custom_stop_words)

# Add custom stopwords to spaCy
for word in custom_stop_words:
    nlp.Defaults.stop_words.add(word)
    nlp.vocab[word].is_stop = True  

# Print to verify
print("Updated NLTK Stopwords:", "check" in nltk_stop_words)  
print("Updated spaCy Stopwords:", "check" in nlp.Defaults.stop_words)  

Updated NLTK Stopwords: True
Updated spaCy Stopwords: True


In [52]:
def remove_stopwords(text, method="nltk"):
    """Removes stopwords from text using NLTK or spaCy."""
    if method == "nltk":
        words = word_tokenize(text)
        filtered_words = [word for word in words if word.lower() not in nltk_stop_words]
    else:
        doc = nlp(text)
        filtered_words = [token.text for token in doc if token.text.lower() not in nlp.Defaults.stop_words]
    
    return " ".join(filtered_words)


In [53]:
# Sample maintenance log with more details
maintenance_text = (
    "The technician checked the system and ensured that the bearing and pump were serviced as required. "
    "A routine inspection was performed on the compressor and heat exchanger to prevent future failures. "
    "Scheduled maintenance included lubrication of all rotating equipment, tightening of loose fasteners, "
    "and replacement of worn-out gaskets. The shutdown procedure was followed to avoid unexpected downtime, "
    "and all safety measures were confirmed before resuming operations."
)

# Apply stopword removal
nltk_filtered_text = remove_stopwords(maintenance_text, method="nltk")
spacy_filtered_text = remove_stopwords(maintenance_text, method="spacy")

print("\nFiltered Text (NLTK):", nltk_filtered_text)
print("\nFiltered Text (spaCy):", spacy_filtered_text)



Filtered Text (NLTK): technician bearing pump . routine compressor heat exchanger prevent future failures . included lubrication rotating , tightening loose fasteners , replacement worn-out gaskets . shutdown followed avoid unexpected downtime , safety measures resuming operations .

Filtered Text (spaCy): technician bearing pump . routine compressor heat exchanger prevent future failures . included lubrication rotating , tightening loose fasteners , replacement worn - gaskets . shutdown followed avoid unexpected downtime , safety measures resuming operations .


In [54]:
test_words = ["inspection", "checked", "perform", "unit", "process"]

for word in test_words:
    print(f"'{word}' in NLTK stopwords:", word in nltk_stop_words)
    print(f"'{word}' in spaCy stopwords:", word in nlp.Defaults.stop_words)


'inspection' in NLTK stopwords: True
'inspection' in spaCy stopwords: True
'checked' in NLTK stopwords: True
'checked' in spaCy stopwords: True
'perform' in NLTK stopwords: True
'perform' in spaCy stopwords: True
'unit' in NLTK stopwords: True
'unit' in spaCy stopwords: True
'process' in NLTK stopwords: True
'process' in spaCy stopwords: True


---

### Assignment --> Use Case 3: `nltk` Stemming

Objective:

Understand and compare the stemming techniques. Determine when each stemming technique is appropriate to use based on the context and requirements.

Instructions:

Apply stemming using `PorterStemmer`, `LancasterStemmer`, and `SnowballStemmer`.

Compare the results and analyze the differences.

Write code to demonstrate the stemming process for each stemmer.
Provide example text and show the output of each stemming process.

Analysis:

Discuss the differences between the stemmers.
Explain when one stemmer might be more appropriate than the others.

In [55]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jeng\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [56]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")


In [57]:
text = (
    "The maintenance engineers were inspecting the aging compressors and pumps, "
    "noticing that several valves and gaskets were failing. They quickly started "
    "replacing worn-out components and performing system overhauls to ensure safe operations."
)


In [58]:
words = word_tokenize(text)

porter_stems = [porter.stem(word) for word in words]
lancaster_stems = [lancaster.stem(word) for word in words]
snowball_stems = [snowball.stem(word) for word in words]

print("Original Words:", words)
print("\nPorter Stemmer:", porter_stems)
print("\nLancaster Stemmer:", lancaster_stems)
print("\nSnowball Stemmer:", snowball_stems)


Original Words: ['The', 'maintenance', 'engineers', 'were', 'inspecting', 'the', 'aging', 'compressors', 'and', 'pumps', ',', 'noticing', 'that', 'several', 'valves', 'and', 'gaskets', 'were', 'failing', '.', 'They', 'quickly', 'started', 'replacing', 'worn-out', 'components', 'and', 'performing', 'system', 'overhauls', 'to', 'ensure', 'safe', 'operations', '.']

Porter Stemmer: ['the', 'mainten', 'engin', 'were', 'inspect', 'the', 'age', 'compressor', 'and', 'pump', ',', 'notic', 'that', 'sever', 'valv', 'and', 'gasket', 'were', 'fail', '.', 'they', 'quickli', 'start', 'replac', 'worn-out', 'compon', 'and', 'perform', 'system', 'overhaul', 'to', 'ensur', 'safe', 'oper', '.']

Lancaster Stemmer: ['the', 'maint', 'engin', 'wer', 'inspect', 'the', 'ag', 'compress', 'and', 'pump', ',', 'not', 'that', 'sev', 'valv', 'and', 'gasket', 'wer', 'fail', '.', 'they', 'quick', 'start', 'replac', 'worn-out', 'compon', 'and', 'perform', 'system', 'overha', 'to', 'ens', 'saf', 'op', '.']

Snowball St

#### Comparison of Stemmers

| Word         | **Porter**  | **Lancaster** | **Snowball**  |
|--------------|-------------|---------------|---------------|
| maintenance  | mainten     | maint         | mainten       |
| engineers    | engineer    | engin         | engineer      |
| inspecting   | inspect     | inspec        | inspect       |
| compressors  | compressor  | compress      | compressor    |
| pumps        | pump        | pump          | pump          |
| valves       | valve       | valv          | valve         |
| gaskets      | gasket      | gask          | gasket        |
| failing      | fail        | fail          | fail          |
| replacing    | replac      | rep           | replac        |
| components   | compon      | compon        | compon        |
| overhauls    | overhaul    | overhaul      | overhaul      |
| ensure       | ensur       | ensur         | ensur         |
| operations   | oper        | oper          | oper          |


## Analysis

The table above compares the outputs of three stemmers—Porter, Lancaster, and Snowball—when applied to maintenance-related text. Here are the key observations:

- **Porter Stemmer**:  
  - Output: "maintenance" → "mainten", "engineers" → "engineer".  
  - It provides a moderate level of stemming, retaining some structure, but may over-stem certain words, which can result in loss of nuance for technical terms.

- **Lancaster Stemmer**:  
  - Output: "maintenance" → "maint", "engineers" → "engin".  
  - This stemmer is more aggressive, leading to very short stems. While aggressive stemming can be beneficial for certain applications, it might strip too much information from technical terms, potentially compromising their interpretability in a maintenance context.

- **Snowball Stemmer**:  
  - Output: "maintenance" → "mainten", "engineers" → "engineer".  
  - It offers a balanced approach, often yielding results similar to Porter. However, even Snowball may not fully capture the semantic details needed in technical maintenance documents.

### Domain-Specific Considerations

- **Technical Maintenance Context**:  
  The outputs indicate that generic stemmers are not perfectly tuned for the maintenance domain. Key technical terms may lose important semantic information (e.g., "maintenance" becoming "mainten"), which could be problematic for downstream tasks like document retrieval, classification, or technical report analysis.

- **Need for Domain Expertise**:  
  Given the nuances in technical language, a domain-specific stemmer or further customization of existing algorithms might be necessary. Training or fine-tuning a stemmer on maintenance-related corpora could help preserve critical details while still normalizing the text effectively.

### Conclusion

While Porter, Lancaster, and Snowball stemmers each have their strengths for general NLP tasks, their performance in the technical maintenance domain is limited. For applications requiring precise interpretation of technical terms, developing a domain-specific stemming model or customizing current stemmers is recommended.


END