In [None]:
# This notebook demonstrates fundamental of This chapter covered key preprocessing techniques, including
# stop word removal, stemming, lemmatization, regular expressions, and tokenization.
# Each exercise focuses on a specific preprocessing technique:
# 1.  **Stop Word Removal**: Eliminating common words (like 'the', 'is', 'a') that typically do not carry significant meaning.
# 2.  **Stemming**: Reducing words to their root or base form (stem) by chopping off suffixes.
# 3.  **Lemmatization**: Reducing words to their dictionary or lemma form, which is a linguistically valid base form.
# 4.  **Tokenization**: Breaking down text into smaller units called tokens (words or sentences).
# 5.  **Part-of-Speech (POS) Tagging**: Assigning grammatical categories (e.g., noun, verb, adjective) to each word.
# 6.  **Frequency Distribution**: Counting the occurrences of each unique word or token in a given text.

## NLP Preprocessing Techniques

Preprocessing is a crucial step in Natural Language Processing (NLP) workflows. It involves cleaning and transforming raw text data into a format that is more suitable for machine learning models and analytical tasks. Each technique plays a vital role in enhancing the quality and efficiency of NLP applications:

*   **Stop Word Removal:**
    *   **Importance:** Reduces noise in text data by eliminating common words that typically don't carry significant meaning (e.g., 'the', 'is', 'a'). This helps in focusing on more important terms for analysis, reduces the dimensionality of the data, and improves the performance of algorithms by removing irrelevant features.

*   **Stemming:**
    *   **Importance:** Groups words with similar meanings (e.g., 'connect', 'connecting', 'connected' all reduce to 'connect'). This helps in reducing the vocabulary size, standardizing words, and can improve recall in information retrieval systems by matching different forms of a word to a single root.

*   **Lemmatization:**
    *   **Importance:** Similar to stemming, it reduces words to their base dictionary form (lemma), but ensures the resulting word is a linguistically valid term. This provides a more accurate representation of words, which is crucial for applications requiring high linguistic precision, such as machine translation or question-answering systems.

*   **Tokenization:**
    *   **Importance:** Breaks down raw text into fundamental units (tokens like words or sentences). This is the very first step in almost any NLP task, as it converts unstructured text into structured elements that can be further processed and analyzed by algorithms.

*   **Part-of-Speech (POS) Tagging:**
    *   **Importance:** Assigns grammatical categories to each word (e.g., noun, verb, adjective). This contextual information is invaluable for understanding the syntactic structure of a sentence, disambiguating word meanings (e.g., 'bank' as a financial institution vs. river bank), and is often a prerequisite for more advanced NLP tasks like named entity recognition and parsing.

*   **Frequency Distribution:**
    *   **Importance:** Provides insights into the most common words or tokens in a corpus. This helps in identifying key themes, prevalent topics, and important terms within the text, which is useful for feature selection, keyword extraction, and understanding the overall composition of the data.

### Exercise 1: Stop Word Removal

**Task:** Use the `nltk` library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

**Solution:**

In [2]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

# Tokenize the text
tokens = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:")
print(tokens)
print("\nFiltered Tokens:")
print(filtered_tokens)

Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']

Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Challenge 1: Stop Word Removal with Punctuation

**Task:** Modify the stop word removal exercise to also remove punctuation from the text before filtering stop words.

**Hint:** You might need the `string` module and `str.translate` or `re` module for punctuation removal.

###Exercise 2: Stemming
**Task:** Use the nltk library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

**Solution:**

In [5]:
import nltk
from nltk.stem import PorterStemmer
import string

# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

# Tokenize the text
tokens = text.split()

# Initialize the stemmer
stemmer = PorterStemmer()

# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]

# remove punctuation from stemmed tokens
stemmed_tokens = [word.translate(str.maketrans('', '', string.punctuation)) for word in stemmed_tokens]

# remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]



print("Original Tokens:")
print(tokens)
print("\nStemmed Tokens:")
print(stemmed_tokens)

Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']

Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form', 'which', 'can', 'be', 'benefici', 'for', 'text', 'processing']


### Challenge 2: Compare Stemmers

**Task:** Implement stemming using another stemmer from `nltk.stem` (e.g., `SnowballStemmer` or `LancasterStemmer`) on the same text. Compare the results with the `PorterStemmer` output and note any differences.

**Hint:** Remember to import the new stemmer.

In [6]:
# using snowball stemmer
from nltk.stem import SnowballStemmer

text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
tokens = text.split()
snowball_stemmer = SnowballStemmer("english")
snowball_stemmed_tokens = [snowball_stemmer.stem(word) for word in tokens]
print("\nSnowball Stemmed Tokens:")
print(snowball_stemmed_tokens)


Snowball Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'processing.']


### Exercise 3: Lemmatization

**Task:** Use the `nltk` library to perform lemmatization on the following text: "Lemmatization aims to reduce words to their base or dictionary form, which is useful for linguistic analysis."

**Solution:**

In [7]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary data for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng') # Added to resolve LookupError for averaged_perceptron_tagger_eng

# Sample text
text = "Lemmatization aims to reduce words to their base or dictionary form, which is useful for linguistic analysis."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN # Default to noun if no clear tag

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

# Lemmatize the tokens with POS tags
lemmatized_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]

print("Original Tokens:")
print(tokens)
print("\nLemmatized Tokens:")
print(lemmatized_tokens)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


Original Tokens:
['Lemmatization', 'aims', 'to', 'reduce', 'words', 'to', 'their', 'base', 'or', 'dictionary', 'form', ',', 'which', 'is', 'useful', 'for', 'linguistic', 'analysis', '.']

Lemmatized Tokens:
['Lemmatization', 'aim', 'to', 'reduce', 'word', 'to', 'their', 'base', 'or', 'dictionary', 'form', ',', 'which', 'be', 'useful', 'for', 'linguistic', 'analysis', '.']


### Challenge 3: Lemmatize Different POS Tags

**Task:** Extend the lemmatization exercise by choosing one word that can be a noun and a verb (e.g., 'runs', 'running') and demonstrate how lemmatization changes based on the assigned POS tag.

**Hint:** You'll need to explicitly pass the correct WordNet POS tag (`wordnet.NOUN` or `wordnet.VERB`) to the `lemmatize` function.

In [8]:
''' ### Challenge 3: Lemmatize Different POS Tags

**Task:** Extend the lemmatization exercise by choosing one word that can be a noun and a verb (e.g., 'runs', 'running') and demonstrate how lemmatization changes based on the assigned POS tag.

**Hint:** You'll need to explicitly pass the correct WordNet POS tag (`wordnet.NOUN` or `wordnet.VERB`) to the `lemmatize` function.'''

word = "running"
lemmatizer = WordNetLemmatizer()
lemma_as_noun = lemmatizer.lemmatize(word, wordnet.NOUN)
lemma_as_verb = lemmatizer.lemmatize(word, wordnet.VERB)    
print(f"\nWord: '{word}'")
print(f"Lemma as Noun: '{lemma_as_noun}'")
print(f"Lemma as Verb: '{lemma_as_verb}'")



Word: 'running'
Lemma as Noun: 'running'
Lemma as Verb: 'run'


###Exercise 4: Tokenization
**Task:** Use the nltk library to perform word and sentence tokenization on the following paragraph: "Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The goal is to make it easier for computers to process natural language."

**Solution:**

In [9]:
import nltk

# Download necessary data for tokenization
nltk.download('punkt')
nltk.download('punkt_tab') # Added to resolve LookupError for punkt_tab

# Sample text
text = "Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The goal is to make it easier for computers to process natural language."

# Word tokenization
word_tokens = nltk.word_tokenize(text)

# Sentence tokenization
sentence_tokens = nltk.sent_tokenize(text)

print("Original Text:")
print(text)
print("\nWord Tokens:")
print(word_tokens)
print("\nSentence Tokens:")
print(sentence_tokens)

Original Text:
Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The goal is to make it easier for computers to process natural language.

Word Tokens:
['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'a', 'stream', 'of', 'text', 'into', 'words', ',', 'phrases', ',', 'symbols', ',', 'or', 'other', 'meaningful', 'elements', 'called', 'tokens', '.', 'The', 'goal', 'is', 'to', 'make', 'it', 'easier', 'for', 'computers', 'to', 'process', 'natural', 'language', '.']

Sentence Tokens:
['Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens.', 'The goal is to make it easier for computers to process natural language.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Challenge 4: Regex Tokenization

**Task:** Instead of `nltk.word_tokenize`, use `nltk.regexp_tokenize` to tokenize the sample text. Create a regular expression that captures words (alphanumeric sequences) and punctuation marks as separate tokens.

**Hint:** A pattern like `r'\w+|\S'` might be a good starting point.


###Exercise 5: Part-of-Speech (POS) Tagging
**Task:** Use the nltk library to perform Part-of-Speech (POS) tagging on the sentence: "The quick brown fox jumps over the lazy dog."

**Solution:**

In [None]:
import nltk

# Download necessary data for POS tagging
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng') # Added to resolve LookupError for averaged_perceptron_tagger_eng

# Sample sentence
sentence = "r'\w+|\S"

# Tokenize the sentence
tokens = nltk.regexp_tokenize(sentence, r'\w+')

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

print("Original Sentence:")
print(sentence)
print("\nTokens:")
print(tokens)
print("\nPOS Tags:")
print(pos_tags)

  sentence = "r'\w+|\S"
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


TypeError: regexp_tokenize() missing 1 required positional argument: 'pattern'

In [None]:
import nltk

# Download necessary data for POS tagging
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng') # Added to resolve LookupError for averaged_perceptron_tagger_eng

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)

print("Original Sentence:")
print(sentence)
print("\nTokens:")
print(tokens)
print("\nPOS Tags:")
print(pos_tags)

### Challenge 5: Detailed POS Tagging Analysis

**Task:** For the given sentence, not only display the POS tags but also explain what each tag means (e.g., 'DT' for determiner, 'JJ' for adjective, 'NN' for noun, 'VBZ' for verb, 3rd person singular present).

**Hint:** You can find a list of NLTK POS tags and their meanings in the NLTK documentation or by searching online.

###Exercise 6: Frequency Distribution
**Task:** Calculate the frequency distribution of words in the following text, and display the 5 most common words: "Natural language processing (NLP) is a field of artificial intelligence, machine learning, and deep learning. NLP helps computers understand and process human language."

**Solution:**

In [None]:
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords

# Download necessary data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab') # Added to resolve LookupError for punkt_tab

# Sample text
text = "Natural language processing (NLP) is a field of artificial intelligence, machine learning, and deep learning. NLP helps computers understand and process human language."

# Tokenize the text
tokens = nltk.word_tokenize(text.lower()) # Convert to lowercase for consistent counting

# Remove stop words and punctuation (optional but good for freq dist)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Calculate frequency distribution
fdist = FreqDist(filtered_tokens)

print("Original Text:")
print(text)
print("\nFiltered Tokens for Frequency Distribution:")
print(filtered_tokens)
print("\nMost Common 5 Words:")
print(fdist.most_common(5))

### Challenge 6: Frequency Distribution with Bigrams

**Task:** Instead of individual word frequency, calculate the frequency distribution of bigrams (sequences of two words) in the sample text. Display the 5 most common bigrams.

**Hint:** You can use `nltk.bigrams` to generate bigrams from your filtered tokens.