Task 1: Individual Exercise: Preprocess Your Own Text
1. Collect Data: Copy a short paragraph (3-5 sentences) from a website or social media.
2. Tokenize: Split the text into individual words.
3. Remove Stop Words: Filter out common words like "the," "is," etc.
4. Stem: Reduce words to their root forms.
5. Lemmatize: Use POS tagging for context-aware lemmatization.
6. Submit Results: Share your paragraph, final tokens, and a brief reflection.

In [3]:
# Import necessary libraries
from nltk.tokenize import word_tokenize # For splitting text into tokens
from nltk.corpus import stopwords # For stop word removal
from nltk.stem import PorterStemmer # For stemming words
from nltk.stem import WordNetLemmatizer # For lemmatizing words
from nltk import pos_tag, download # For POS tagging
from nltk.corpus import wordnet # For WordNet integration(lemmatization)
import re # For special character removal
# Download necessary datasets if not already downloaded
download('wordnet') # For lemmatization support
download('averaged_perceptron_tagger') # For POS tagging
# Sample text examples (students can replace these with their own)
texts = [
 "AI is transforming industries rapidly! From healthcare to education, its applications are endless.",
 "The sun rises in the east, setting a golden glow over thehorizon.",
 "John's cat, which is black and white, loves to play with its toys.",
 "Weather forecasts have improved dramatically due to AI-poweredmodels."
]
# Iterate over each example text
for idx, text in enumerate(texts):
 print(f"\n**Example {idx + 1}: Original Text**")
 print(text)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Birendra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Birendra\AppData\Roaming\nltk_data...



**Example 1: Original Text**
AI is transforming industries rapidly! From healthcare to education, its applications are endless.

**Example 2: Original Text**
The sun rises in the east, setting a golden glow over thehorizon.

**Example 3: Original Text**
John's cat, which is black and white, loves to play with its toys.

**Example 4: Original Text**
Weather forecasts have improved dramatically due to AI-poweredmodels.


[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


In [5]:
 # Step 2a: Stop Word Removal - Removing common words that add no semantic value
 stop_words = set(stopwords.words('english')) # Load English stopwords
 tokens_without_stopwords = [word for word in tokens if word.lower() not in stop_words]
 print("\nTokens After Stop Word Removal:",tokens_without_stopwords)


Tokens After Stop Word Removal: ['Weather', 'forecasts', 'improved', 'dramatically', 'due', 'AI-poweredmodels', '.']


In [7]:
 # Step 2b: Special Character Removal - Removing punctuation or special symbols
 tokens_cleaned = [re.sub(r'[^\w\s]', '', word) for word in tokens_without_stopwords if re.sub(r'[^\w\s]', '', word)]
 print("\nTokens After Special Character Removal:", tokens_cleaned)


Tokens After Special Character Removal: ['Weather', 'forecasts', 'improved', 'dramatically', 'due', 'AIpoweredmodels']


In [8]:
 # Step 3: Stemming - Reducing words to their root forms
 stemmer = PorterStemmer()
 stemmed_tokens = [stemmer.stem(word) for word in tokens_cleaned]
 print("\nStemmed Tokens:", stemmed_tokens)


Stemmed Tokens: ['weather', 'forecast', 'improv', 'dramat', 'due', 'aipoweredmodel']


In [10]:
# Step 4: Lemmatization with POS Tagging - Context-aware wordnormalization
lemmatizer = WordNetLemmatizer()

In [15]:
# Step 1: Tokenization - Splitting text into individual words
tokens = word_tokenize(text)
print("\nTokens:", tokens)
 # Step 2a: Stop Word Removal - Removing common words that add no semantic value
stop_words = set(stopwords.words('english')) # Load English stopwords
tokens_without_stopwords = [word for word in tokens if word.lower() not in stop_words]
print("\nTokens After Stop Word Removal:",tokens_without_stopwords)
 # Step 2b: Special Character Removal - Removing punctuation or special symbols
tokens_cleaned = [re.sub(r'[^\w\s]', '', word) for word in tokens_without_stopwords if re.sub(r'[^\w\s]', '', word)]
print("\nTokens After Special Character Removal:", tokens_cleaned)
 # Step 3: Stemming - Reducing words to their root forms
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens_cleaned]
print("\nStemmed Tokens:", stemmed_tokens)
 # Step 4: Lemmatization with POS Tagging - Context-aware wordnormalization
lemmatizer = WordNetLemmatizer()
 # Helper function to get WordNet POS tag
def get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper() # Get the first letter of POS tag
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V":
    wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN) # Default to noun if tag is not found
 # Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word,get_wordnet_pos(word)) for word in tokens_cleaned]
print("\nLemmatized Tokens (with POS):", lemmatized_tokens)
 # Final Reflection
print(f"\n**Final Results for Example {idx + 1}:**")
print("Original Tokens:", tokens)
print("Tokens After Stop Word Removal:", tokens_without_stopwords)
print("Tokens After Special Character Removal:", tokens_cleaned)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
print("\nTask Complete! Replace the example texts with your own for practice.")



Tokens: ['Weather', 'forecasts', 'have', 'improved', 'dramatically', 'due', 'to', 'AI-poweredmodels', '.']

Tokens After Stop Word Removal: ['Weather', 'forecasts', 'improved', 'dramatically', 'due', 'AI-poweredmodels', '.']

Tokens After Special Character Removal: ['Weather', 'forecasts', 'improved', 'dramatically', 'due', 'AIpoweredmodels']

Stemmed Tokens: ['weather', 'forecast', 'improv', 'dramat', 'due', 'aipoweredmodel']


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng/[0m

  Searched in:
    - 'C:\\Users\\Birendra/nltk_data'
    - 'C:\\Users\\Birendra\\anaconda3\\envs\\cleanenv\\nltk_data'
    - 'C:\\Users\\Birendra\\anaconda3\\envs\\cleanenv\\share\\nltk_data'
    - 'C:\\Users\\Birendra\\anaconda3\\envs\\cleanenv\\lib\\nltk_data'
    - 'C:\\Users\\Birendra\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************
