Q1. Write a unique paragraph (5-6 sentences) about your favorite topic (e.g., sports,
technology, food, books, etc.).
1. Convert text to lowercase and remove punctuation.
2. Tokenize the text into words and sentences.
3. Remove stopwords (using NLTK's stopwords list).
4. Display word frequency distribution (excluding stopwords).

In [8]:
import nltk
nltk.download('punkt')       # For tokenization
nltk.download('stopwords')   # For stopword removal
nltk.download('wordnet')     # For lemmatization
nltk.download('punkt_tab')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Swast\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Swast\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Swast\AppData\Roaming\nltk_data...
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Swast\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True


import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from collections import Counter
import string

text = """
In recent years, artificial intelligence has revolutionized the way we interact with technology.
From voice assistants responding to casual conversation to self‑driving cars navigating busy streets, the impact is profound.
Developers are racing to build smarter algorithms that can learn and adapt in real time.
Meanwhile, concerns around data privacy and ethical use of AI continue to grow.
It is clear that the next decade will be shaped by breakthroughs in machine learning and neural network research.
"""

# 2. Lowercase & remove punctuation
clean = text.lower().translate(str.maketrans('', '', string.punctuation))

# 3. Tokenize
sentences = sent_tokenize(clean)
words = word_tokenize(clean)

# 4. Remove stopwords
stops = set(stopwords.words('english'))
filtered = [w for w in words if w not in stops and w.isalpha()]

# 5. Word frequency
freq = Counter(filtered)
print("Sentence tokens:", sentences)
print("Word tokens:", words)
print("Filtered words:", filtered)
print("Word Frequency Distribution:")
for word, count in freq.most_common():
    print(f"{word}: {count}")

Q2: Stemming and Lemmanization
1. Take the tokenized words from Question 1 (aŌer stopword removal).
2. Apply stemming using NLTK's PorterStemmer and LancasterStemmer.
3. Apply lemmanization using NLTK's WordNetLemmaƟzer.
4. Compare and display results of both techniques. 

In [21]:
# Q2: stemming & lemmatization
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

# If running first time, you may need:
# nltk.download('wordnet')

porter = PorterStemmer()
lancaster = LancasterStemmer()
lemmatizer = WordNetLemmatizer()

results = []
for w in filtered:
    results.append({
        'original': w,
        'porter': porter.stem(w),
        'lancaster': lancaster.stem(w),
        'lemma': lemmatizer.lemmatize(w)
    })

# Display in a neat table
import pandas as pd
df2 = pd.DataFrame(results)
print(df2)


          original        porter     lancaster           lemma
0           recent        recent           rec          recent
1            years          year          year            year
2       artificial      artifici           art      artificial
3     intelligence      intellig      intellig    intelligence
4   revolutionized    revolution        revolv  revolutionized
5              way           way           way             way
6         interact      interact      interact        interact
7       technology     technolog     technolog      technology
8            voice          voic          voic           voice
9       assistants        assist        assist       assistant
10      responding       respond       respond      responding
11          casual        casual           cas          casual
12    conversation       convers       convers    conversation
13            cars           car           car             car
14      navigating         navig         navig      nav

Q3. Regular Expressions and Text Spliƫng
1. Take their original text from Question 1.
2. Use regular expressions to:
a. Extract all words with more than 5 letters.
b. Extract all numbers (if any exist in their text).
c. Extract all capitalized words.
3. Use text spliƫng techniques to:
a. Split the text into words containing only alphabets (removing digits and special
characters).
b. Extract words starƟng with a vowel.


In [24]:
# Q3: regex & splitting
import re

orig = text  # use the original (with punctuation & case)

# a. >5 letters
long_words = re.findall(r'\b\w{6,}\b', orig)
# b. Numbers
numbers = re.findall(r'\d+(?:\.\d+)?', orig)
# c. Capitalized words
capitalized = re.findall(r'\b[A-Z][a-z]+\b', orig)

# d. Split into alpha-only words
alpha_words = re.findall(r'\b[A-Za-z]+\b', orig)
# e. Words starting with vowel
vowel_words = [w for w in alpha_words if re.match(r'^[AEIOUaeiou]', w)]

print("Words >5 letters:", long_words)
print("Numbers:", numbers)
print("Capitalized words:", capitalized)
print("Alpha-only words:", alpha_words)
print("Words starting with vowel:", vowel_words)

Words >5 letters: ['recent', 'artificial', 'intelligence', 'revolutionized', 'interact', 'technology', 'assistants', 'responding', 'casual', 'conversation', 'driving', 'navigating', 'streets', 'impact', 'profound', 'Developers', 'racing', 'smarter', 'algorithms', 'Meanwhile', 'concerns', 'around', 'privacy', 'ethical', 'continue', 'decade', 'shaped', 'breakthroughs', 'machine', 'learning', 'neural', 'network', 'research']
Numbers: []
Capitalized words: ['In', 'From', 'Developers', 'Meanwhile', 'It']
Alpha-only words: ['In', 'recent', 'years', 'artificial', 'intelligence', 'has', 'revolutionized', 'the', 'way', 'we', 'interact', 'with', 'technology', 'From', 'voice', 'assistants', 'responding', 'to', 'casual', 'conversation', 'to', 'self', 'driving', 'cars', 'navigating', 'busy', 'streets', 'the', 'impact', 'is', 'profound', 'Developers', 'are', 'racing', 'to', 'build', 'smarter', 'algorithms', 'that', 'can', 'learn', 'and', 'adapt', 'in', 'real', 'time', 'Meanwhile', 'concerns', 'aroun

Q4. Custom TokenizaƟon & Regex-based Text Cleaning
1. Take original text from QuesƟon 1.
2. Write a custom tokenizaƟon funcƟon that:
a. Removes punctuaƟon and special symbols, but keeps contracƟons (e.g.,
"isn't" should not be split into "is" and "n't").
b. Handles hyphenated words as a single token (e.g., "state-of-the-art" remains
a single token).
c. Tokenizes numbers separately but keeps decimal numbers intact (e.g., "3.14"
should remain as is).
3. Use Regex SubsƟtuƟons (re.sub) to:
a. Replace email addresses with '<EMAIL>' placeholder.
b. Replace URLs with '<URL>' placeholder.
c. Replace phone numbers (formats: 123-456-7890 or +91 9876543210) with
'<PHONE>' placeholder. 

In [27]:
# Q4: custom tokenizer & cleaning
def custom_tokenize(s):
    # placeholder for emails/URLs/phones before tokenization
    s = re.sub(r'\b[\w\.-]+@[\w\.-]+\.\w+\b', '<EMAIL>', s)
    s = re.sub(r'https?://\S+|www\.\S+', '<URL>', s)
    s = re.sub(r'\+?\d{1,3}[\s-]\d{6,10}', '<PHONE>', s)
    s = re.sub(r'\b\d+\.\d+\b', lambda m: f"<NUM:{m.group(0)}>", s)  # mark decimals
    s = re.sub(r'\b\d+\b', '<NUM>', s)
    # now split on whitespace & punctuation except hyphens/apostrophes
    tokens = re.findall(r"[A-Za-z0-9<>:_]+(?:['-][A-Za-z0-9]+)*", s)
    # restore decimal tokens
    return [t.replace('<NUM:', '').replace('>', '') if t.startswith('<NUM:') else t for t in tokens]

sample = text.strip()
tokens_q4 = custom_tokenize(sample)
print("Custom tokens:", tokens_q4)

Custom tokens: ['In', 'recent', 'years', 'artificial', 'intelligence', 'has', 'revolutionized', 'the', 'way', 'we', 'interact', 'with', 'technology', 'From', 'voice', 'assistants', 'responding', 'to', 'casual', 'conversation', 'to', 'self', 'driving', 'cars', 'navigating', 'busy', 'streets', 'the', 'impact', 'is', 'profound', 'Developers', 'are', 'racing', 'to', 'build', 'smarter', 'algorithms', 'that', 'can', 'learn', 'and', 'adapt', 'in', 'real', 'time', 'Meanwhile', 'concerns', 'around', 'data', 'privacy', 'and', 'ethical', 'use', 'of', 'AI', 'continue', 'to', 'grow', 'It', 'is', 'clear', 'that', 'the', 'next', 'decade', 'will', 'be', 'shaped', 'by', 'breakthroughs', 'in', 'machine', 'learning', 'and', 'neural', 'network', 'research']
