## DATA 622 Natural Language Processing
### Homework 2

Questions
Proceed as indicated:
1. Read the File
Read the content of https://www.cnn.com/2025/06/13/style/why-luxury-brands-are-soexpensive and print the first 700 characters.
2. Remove HTML Tags
If any HTML tags are present in the file, remove them so that only the raw text remains.
3. Lower and Remove Punctuation
Convert all text to lowercase and remove all punctuation characters.
4. Remove Stopwords
Remove English stopwords from the text. (Use NLTK’s list of stopwords.)
5. Lemmatize the Text
Lemmatize all remaining words (use NLTK’s WordNetLemmatizer) and print the first 50
lemmatized words. Is there any difference in your output if you stemmed the text?

In [1]:
import requests
from bs4 import BeautifulSoup
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Download NLTK resources if not already present
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [3]:
# Step 1: Read the File (given article)
url = "https://www.cnn.com/2025/06/13/style/why-luxury-brands-are-soexpensive"
response = requests.get(url)
html_content = response.text
print("First 700 characters of raw file:\n")
print(html_content[:700])

First 700 characters of raw file:

<!DOCTYPE html><html class="no-js"><head><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"><meta charset="utf-8"><meta content="text/html" http-equiv="Content-Type"><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0"><link rel="dns-prefetch" href="/optimizelyjs/131788053.js" /><link rel="dns-prefetch" href="//tpc.googlesyndication.com" /><link rel="dns-prefetch" href="//pagead2.googlesyndication.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="//partner.googleadservices.com" /><link rel="dns-prefetch" href="//www.google.com" /><link rel="dns-prefetch" href="//aax.amazon-adsystem.com" /><link r


In [4]:
# Step 2: Remove HTML Tags
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()

In [5]:
# Step 3: Lowercase and remove punctuation
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))

In [6]:
# Step 4: Remove Stopwords
import nltk
nltk.download("punkt_tab")

tokens = nltk.word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words and word.isalpha()]


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [7]:
# Step 5: Lemmatize
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("\nFirst 50 lemmatized words:\n")
print(lemmatized_words[:50])


First 50 lemmatized words:

['error', 'cnnopen', 'justiceenergy', 'environmentextreme', 'weatherspace', 'scienceworldafricaamericasasiaaustraliachinaeuropeindiamiddle', 'eastunited', 'electionfacts', 'firstelection', 'opedssocial', 'commentaryhealthfoodfitnesswellnessparentingvital', 'signsentertainmentstarsscreenbingeculturemediatechinnovategadgetforeseeable', 'futuremission', 'aheadupstartswork', 'transformedinnovative', 'citiesstyleartsdesignfashionarchitectureluxurybeautyvideotraveldestinationsfood', 'drinkstaynewsvideossportspro', 'footballcollege', 'footballbasketballbaseballsoccerolympicsvideoslive', 'tv', 'digital', 'studioscnn', 'filmshlntv', 'scheduletv', 'show', 'azcnnvrcouponscnn', 'underscoredexplorewellnessgadgetslifestylecnn', 'storemorephotoslongforminvestigationscnn', 'profilescnn', 'leadershipcnn', 'newsletterswork', 'cnnweatherclimatestorm', 'trackerwildfire', 'trackervideofollow', 'cnn', 'uhohit', 'could', 'could', 'u', 'there', 'page', 'searchuscrime', 'justiceene

In [8]:
# Comparing to see if there is any difference in output
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_tokens]

print("\nFirst 50 stemmed words:\n")
print(stemmed_words[:50])



First 50 stemmed words:

['error', 'cnnopen', 'justiceenergi', 'environmentextrem', 'weatherspac', 'scienceworldafricaamericasasiaaustraliachinaeuropeindiamiddl', 'eastunit', 'electionfact', 'firstelect', 'opedssoci', 'commentaryhealthfoodfitnesswellnessparentingvit', 'signsentertainmentstarsscreenbingeculturemediatechinnovategadgetforese', 'futuremiss', 'aheadupstartswork', 'transformedinnov', 'citiesstyleartsdesignfashionarchitectureluxurybeautyvideotraveldestinationsfood', 'drinkstaynewsvideossportspro', 'footballcolleg', 'footballbasketballbaseballsoccerolympicsvideosl', 'tv', 'digit', 'studioscnn', 'filmshlntv', 'scheduletv', 'show', 'azcnnvrcouponscnn', 'underscoredexplorewellnessgadgetslifestylecnn', 'storemorephotoslongforminvestigationscnn', 'profilescnn', 'leadershipcnn', 'newsletterswork', 'cnnweatherclimatestorm', 'trackerwildfir', 'trackervideofollow', 'cnn', 'uhohit', 'could', 'could', 'us', 'there', 'page', 'searchuscrim', 'justiceenergi', 'environmentextrem', 'weathers