# NLP Assignment 1
Author: Vincent Itucal

## Assignment -->  Use Case 1: Removing subsequent occurrence of words.

Removing subsequent occurrences of words (also known as deduplication of adjacent duplicate words) is a common preprocessing step in NLP. This task is important because: 
1. Repeated words can distort text analysis, especially in tasks like text summarization, sentiment analysis, and language modeling.
2. Removing redundant words improves the readability of the text, making it more coherent.
3. Reducing noise in the text data can improve the performance of machine learning models.

Input:
A single string text that may contain multiple sentences and words. Words are separated by spaces.

Output:
A single string with subsequent duplicate words removed.

Constraints:
The input string can be empty.
The words are case-sensitive, meaning "Word" and "word" are considered different.

In [1]:
import nltk
import nltk
import spacy
# nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\vsitu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\vsitu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
def remove_adjacent_duplicates_nltk(text):
    if not text:
        return ""  # Return empty string if input is empty

    sentences = sent_tokenize(text)  # Split text into sentences
    detokenizer = TreebankWordDetokenizer()
    cleaned_sentences = []
    prev_sentence = None

    for sentence in sentences:
        words = word_tokenize(sentence)  # Tokenize sentence into words
        if not words:
            continue

        # Remove adjacent duplicate words (Case-Sensitive)
        filtered_words = [words[0]]
        for i in range(1, len(words)):
            if words[i] != words[i - 1]:  # Case-sensitive comparison
                filtered_words.append(words[i])

        cleaned_sentence = detokenizer.detokenize(filtered_words)  # Proper spacing

        # Avoid adding duplicate adjacent sentences (Case-Sensitive)
        if cleaned_sentence != prev_sentence:
            cleaned_sentences.append(cleaned_sentence)
            prev_sentence = cleaned_sentence  # Keep case-sensitive tracking

    return " ".join(cleaned_sentences)  # Proper spacing between sentences

In [3]:
text = "Hello world, World. Hello world, world. This is is a test test!  This is is a test test!"
print(remove_adjacent_duplicates_nltk(text))

Hello world, World. Hello world, world. This is a test!


## Assignment --> Use Case 2: Adding Custom Stop Words to `nltk` and `spacy`


Adding custom stop words is a crucial preprocessing step in NLP. This task is important because:

Customizing stop words allows for more flexible and relevant text cleaning tailored to specific use cases.
Adding domain-specific stop words improves the performance of text analysis and machine learning models by removing irrelevant terms.
Enhances the readability and coherence of the text by eliminating non-essential words.

Objective:
Extend the default stop words list in both `nltk` and `spacy` by adding custom stop words.

Input:
A list of custom stop words to be added to the existing stop words list in `nltk` and `spacy.

Output:
A function that takes a string and returns the text with both default and custom stop words removed.

Constraints:
The input string can be empty.
The words are case-sensitive, meaning "Word" and "word" are considered different.

Instructions:
Add custom stop words to `nltk`'s default stop words list.
Add custom stop words to `spacy`'s default stop words list.
Remove stop words from a given text using the updated stop words list for both `nltk` and `spacy.

**Note: Please ensure that the custom stop words you add are unique to your implementation. When testing and checking your notebooks, I will include these specific words to ensure they have been correctly added to your stop words list.**

Custom Stop Words to Use:    
"customword1";  
"customword2";  
"customword3" 

### Add custom words here

In [4]:
custom_stop_words = {"cat", "dog", "sheep"}

In [5]:
# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Extend NLTK stop words
nltk_stop_words = set(stopwords.words("english")).union(custom_stop_words)

# Extend spaCy stop words
for word in custom_stop_words:
    nlp.Defaults.stop_words.add(word)
    nlp.vocab[word].is_stop = True

def remove_stop_words_nltk(text: str) -> str:
    """Removes stop words from the given text using nltk."""
    if not text.strip():  # Handle empty input
        return ""

    # Tokenize using NLTK
    words = word_tokenize(text)

    # Remove stop words using NLTK's list
    filtered_words = [word for word in words if word.lower() not in nltk_stop_words]
    return filtered_words

def remove_stop_words_spacy(text: str) -> str:
    """Removes stop words from the given text using spaCy."""
    if not text.strip():  # Handle empty input
        return ""

    # Process text with spaCy
    doc = nlp(text)

    # Remove stop words
    filtered_words = [token.text for token in doc if not token.is_stop]

    return filtered_words

### nltk version test

In [6]:
# Example usage
text = "This is a sample text with cat, dog, sheep and other words."
cleaned_text = remove_stop_words_nltk(text)
print(cleaned_text)

['sample', 'text', ',', ',', 'words', '.']


### spacy version test

In [7]:
cleaned_text = remove_stop_words_spacy(text)
print(cleaned_text)

['sample', 'text', ',', ',', 'words', '.']


## Assignment --> Use Case 3: `nltk` Stemming

Objective:

Understand and compare the stemming techniques. Determine when each stemming technique is appropriate to use based on the context and requirements.

Instructions:

Apply stemming using `PorterStemmer`, `LancasterStemmer`, and `SnowballStemmer`.

Compare the results and analyze the differences.

Write code to demonstrate the stemming process for each stemmer.
Provide example text and show the output of each stemming process.

Analysis:

Discuss the differences between the stemmers.
Explain when one stemmer might be more appropriate than the others.

In [8]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

# Example text
text = ["running", "jumps", "happily", "flies", "better", "swimming", "flying"]

# Initialize the stemmers
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer("english")

# Apply stemming using each stemmer
porter_stemmed = [porter_stemmer.stem(word) for word in text]
lancaster_stemmed = [lancaster_stemmer.stem(word) for word in text]
snowball_stemmed = [snowball_stemmer.stem(word) for word in text]

# Print the results
print("Original text:", text)
print("Porter Stemmer:", porter_stemmed)
print("Lancaster Stemmer:", lancaster_stemmed)
print("Snowball Stemmer:", snowball_stemmed)

Original text: ['running', 'jumps', 'happily', 'flies', 'better', 'swimming', 'flying']
Porter Stemmer: ['run', 'jump', 'happili', 'fli', 'better', 'swim', 'fli']
Lancaster Stemmer: ['run', 'jump', 'happy', 'fli', 'bet', 'swim', 'fly']
Snowball Stemmer: ['run', 'jump', 'happili', 'fli', 'better', 'swim', 'fli']


- <b>Porter Stemmer</b>
    - Essentially, this stemmer classifies every character in a given token as either a consonant (c) or vowel (v), grouping subsequent consonants as C and subsequent vowels as V. The stemmer thus represents every word token as a combination of consonant and vowel groups. Once enumerated this way, the stemmer runs each word token through a list of rules that specify ending characters to remove according to the number of vowel-consonant groups in a token.<sup>[1]</sup>
    - Foundation for subsequent algorithms
    - Only supports the English language
    - Use when you need a simple, widley used and relatively accurate stemmer.
- <b>Snowball Stemmer</b>
    - Snowball stemmer is an updated version of the Porter stemmer. While it aims to enforce a more robust set of rules for determining suffix removal, it nevertheless remains prone to many of the same errors
    - The snowball stemmer presenting the English language stemmer is called Porter2
    - Supports multiple languages, including English, Russian, Danish, French, Finnish, German, Italian, Hungarian, Portuguese, Norwegian, Swedish, and Spanish.
    - Use when working with multiple languages or needing a balance between Porter’s conservatism and Lancaster’s aggressiveness.
- <b>Lancaster Stemmer</b>
    - Also known as the Paice stemmer—is the most aggressive of English language stemmers.
    - The Lancaster stemmer contains a list of over 100 rules that dictate which ending character strings, if present, to replace with other strings, if any. The stemmer iterates through each word token, checking it against all the rules. If the token’s ending string matches that of a rule, the algorithm enacts the rule’s described operation and then runs the new, transformed word through all of the rules again. The stemmer iterates through all of the rules until a given token passes them all without being transformed<sup>[2]</sup>
    - Only supports the English language
    - Use when you need aggressive stemming but can tolerate some over-stemming.

## References

[1] Martin Porter, "An algorithm for suffix stripping", Program: electronic library and information systems, Vol. 14, No. 3, 1980, pp. 130-137, https://www.emerald.com/insight/content/doi/10.1108/eb046814/full/html 

[2] 12 Chris Paice, “Another stemmer," ACM SIGIR Forum, Vol. 24, No. 3, 1990, pp. 56–61, https://dl.acm.org/doi/10.1145/101306.101310

[3] https://www.ibm.com/think/topics/stemming#:~:text=The%20Snowball%20stemmer%20differs%20from,%2C%20French%2C%20and%20even%20Russian.

[4] https://towardsai.net/p/l/stemming-porter-vs-snowball-vs-lancaster