In [4]:
text = "NLP Is Amazing! But CLEAN Data Is More Important."

In [2]:
lower_text = text.lower()

In [3]:
print(lower_text)

nlp is amazing! but clean data is more important.


exercise-02: Removing Punctuation

In [5]:
text = "Hello, world! Welcome to NLP: Natural Language Processing."


In [6]:
import string

In [7]:
text = "Hello, world! Welcome to NLP: Natural Language Processing."
clean_text = text.translate(str.maketrans('', '', string.punctuation))


str.maketrans('', '', string.punctuation) → creates a mapping table to remove punctuation.

We’re not splitting/joining words yet; just removing punctuation.

In [8]:
clean_text

'Hello world Welcome to NLP Natural Language Processing'

## Exercise 3: Stopword Removal

Goal: Remove common but less informative words like “is”, “the”, “a”, “and” which don’t add much meaning in some NLP tasks.

Example Input: 
text = "NLP is the process of teaching machines to understand human language."

Expected Output:
"NLP process teaching machines understand human language"


In [9]:
import nltk
from nltk.corpus import stopwords

In [10]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:
stop_words = set(stopwords.words('english'))

In [12]:
from nltk.tokenize import word_tokenize

In [13]:
text = "NLP is the process of teaching machines to understand human language."

In [17]:
text = text.lower()
words = word_tokenize(text)
words

['nlp',
 'is',
 'the',
 'process',
 'of',
 'teaching',
 'machines',
 'to',
 'understand',
 'human',
 'language',
 '.']

In [18]:
filtered_word =  [word for word in words if word not in stop_words]
filtered_word

['nlp',
 'process',
 'teaching',
 'machines',
 'understand',
 'human',
 'language',
 '.']

1) Stopword lists vary — NLTK, SpaCy, sklearn all have different sets.

2) Removing stopwords can sometimes hurt performance (e.g., “not good” → removing “not” changes meaning).

3) Use task-dependent judgment — for sentiment analysis, negations like “not” are critical.

## Exercise 4: Stemming

Goal: Reduce words to their root form (without caring about grammar or dictionary correctness).
This helps group similar words: "running", "runs", "ran" → "run".

#### Example Input

text = "I was running faster than anyone who runs daily."


#### Expected Output

"I wa run faster than anyon who run daili"


In [19]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [20]:
ps = PorterStemmer()
text = "I was running faster than anyone who runs daily."

In [21]:
words = word_tokenize(text)
words

['I',
 'was',
 'running',
 'faster',
 'than',
 'anyone',
 'who',
 'runs',
 'daily',
 '.']

In [22]:
stemmed_words = [ps.stem(word) for word in words]
stemmed_words

['i', 'wa', 'run', 'faster', 'than', 'anyon', 'who', 'run', 'daili', '.']

In [23]:
print(" ".join(stemmed_words))

i wa run faster than anyon who run daili .


##### Key Points:

PorterStemmer → simple, rule-based, works fast but can distort words.

LancasterStemmer → more aggressive, may cut too much.

SnowballStemmer → balanced, supports multiple languages.

## Exercise 5: Lemmatization

##### Goal:
 Reduce words to their dictionary form (lemma) while considering grammar and meaning.
E.g., "better" → "good", "running" → "run".

###### Example Input:
text = "The children are running faster than the men."


##### Expected Output:
"The child be run fast than the man"
(Grammar stripped, but meaning preserved better than stemming.)

In [3]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [10]:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\SkyTech\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [5]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [11]:
lemmatizer = WordNetLemmatizer()
text = "The children are running faster than the men."
# text = text.lower()
words = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(words)

In [12]:
lemmatized_words = [
    lemmatizer.lemmatize(word, get_wordnet_pos(tag))
    for word, tag in pos_tags
]

In [13]:
lemmatized_words


['The', 'child', 'be', 'run', 'faster', 'than', 'the', 'men', '.']

In [14]:
print(" ".join(lemmatized_words))

The child be run faster than the men .


In [19]:
lemmatizer.lemmatize("are", wordnet.VERB)

'be'

In [17]:
lemmatizer.lemmatize("running", wordnet.VERB)

'run'

### Full Code:

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

text = "The children are running faster than the men."
words = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(words)  # [('The', 'DT'), ('children', 'NNS'), ...]

lemmatized_words = [
    lemmatizer.lemmatize(word, get_wordnet_pos(tag))
    for word, tag in pos_tags
]

print(lemmatized_words)


###  Exercise 6: Whitespace Normalization

In [23]:
import re

text = "This    is   NLP.   \nIt   has   extra   spaces."
normalized_text = re.sub(r'\s+', ' ', text).strip()
normalized_text

'This is NLP. It has extra spaces.'

#### Exercise 7: Removing Special Characters
###### Goal: Remove symbols, emojis, and non-alphanumeric characters while keeping words and numbers intact.

###### Expected Input:
text = "NLP ❤️ is amazing!!! $$$ #AI @ML"
###### Expected Output:
"NLP is amazing AI ML"


In [24]:
import re

text = "NLP ❤️ is amazing!!! $$$ #AI @ML"

clean_text = re.sub(r'[^A-Za-z0-9\s]', '', text)
clean_text

'NLP  is amazing  AI ML'

### Exercise 8: Removing Numbers

In [25]:
import re

text = "In 2025, NLP models will have 500 billion parameters."
clear_text = re.sub(r'\d+', '', text)
clear_text

'In , NLP models will have  billion parameters.'

##### replace numbers with a placeholder instead of removing them

In [26]:
placeholder_text = re.sub(r'\d+', '<NUM>', text)
placeholder_text

'In <NUM>, NLP models will have <NUM> billion parameters.'

### Exercise 9: Removing URLs & Email Addresses

##### Expected Input:
text = "Visit our site at https://example.com or contact me at test@example.org."
##### Expected Output:
"Visit our site at  or contact me at ."


In [29]:
import re

text = "Visit our site at https://example.com or contact me at test@example.org."

# remove urls

no_urls = re.sub(r'http\S+|www\.\S+', '', text)
print(no_urls)

# Remove Emails
no_emails = re.sub(r'\S+@\S+', '', no_urls)
print(no_emails)


Visit our site at  or contact me at test@example.org.
Visit our site at  or contact me at 


http\S+ → matches http or https followed by any non-space characters.

www\.\S+ → matches www. followed by any non-space characters.

\S+@\S+ → matches an email format: something@something.

### Exercise 10: Handling Hashtags & Mentions

Goal: Remove or clean hashtags (#topic) and mentions (@username) from text.

##### Expected Input:
text = "Loving the #NLP journey with @OpenAI team!"
##### Expected Output:
"Loving the journey with team!"


In [32]:
import re

text = text = "Loving the #NLP journey with @OpenAI team!"

# Remove hashtags with the word
no_hashtags = re.sub(r'#\w+', '', text)
print(no_hashtags)

# Keep Words, Remove Only Symbols

keep_tags = re.sub(r'#', '', text)
print(keep_tags)

# remove mentions
clean_text= re.sub(r'@', '', no_hashtags)
print(clean_text)

# keeping word
clean_text1= re.sub(r'@', '', keep_tags)
print(clean_text1)

Loving the  journey with @OpenAI team!
Loving the NLP journey with @OpenAI team!
Loving the  journey with OpenAI team!
Loving the NLP journey with OpenAI team!


### Exercise 11: Expanding Contractions
Goal: Replace contracted forms like "can't" → "cannot", "I'm" → "I am" for clearer text representation.

##### Expected Input: 
text = "I'm learning NLP and I can't stop now."
##### Expected Output:
"I am learning NLP and I cannot stop now."


In [33]:
import re

In [34]:
contractions_dict = {
    "can't": "cannot",
    "won't": "will not",
    "i'm": "i am",
    "it's": "it is",
    "he's": "he is",
    "she's": "she is",
    "they're": "they are",
    "we're": "we are",
    "that's": "that is",
    "what's": "what is",
    "let's": "let us",
    "n't": " not",
    "'re": " are",
    "'s": " is",
    "'d": " would",
    "'ll": " will",
    "'ve": " have",
    "'m": " am"
}

In [None]:
def expand_contraction(text):
    pattern = re.compile('(%s)' % '|'.join(re.escape(key) for key in contractions_dict.keys()))
    def replace(match):
        return contractions_dict[match.group(0)]  # match.group(0)  ----> retruns exact contraction string that was matched in the text. where 
    return pattern.sub(replace, text.lower())

###### match → a match object from regex that contains the contraction found.
###### match.group(0)  ----> retruns exact contraction string that was matched in the text.
###### contractions_dict[...] → looks up that contraction in our dictionary and returns its expanded form.


###### match.group(0)  # "can't"
###### contractions_dict["can't"]  # "cannot"


#### Example:
###### text = "I'm happy because I can't lose."
###### lowercase = "i'm happy because i can't lose."
###### pattern finds: "i'm" → replace → "i am"
###### pattern finds: "can't" → replace → "cannot"
###### Final: "i am happy because i cannot lose."


In [36]:
text = "I'm learning NLP and I can't stop now."
expand_text = expand_contraction(text)

print(expand_text)

i am learning nlp and i cannot stop now.


##### What the expand_contraction do:
###### 1. Make a “search list”
It takes all the words from your dictionary like:  
        ["can't", "won't", "i'm", "it's", "he's", ...]  
and joins them with OR in regex:  
        (cant|wont|im|its|hes|...)  ← but in real code it's escaped and still keeps apostrophes  
So now, we have a single magic pattern that can find any contraction in your text.  
###### 2. Decide “What to Replace With”  
For every match found, it looks inside your dictionary and says:  

    “Oh, you found can't? Replace it with cannot.”  

    “Oh, you found i'm? Replace it with i am.”

###### 3. 3. Do the Swap  
It goes through your sentence (after turning it to lowercase):  
    "i'm learning nlp and i can't stop now."  
and replaces each contraction using your dictionary:  

    "i am learning nlp and i cannot stop now."


