## DATA 622 Natural Language Processing
### Homework 8

Questions

Use these two articles:
url1 = "https://www.nytimes.com/2024/05/14/climate/climate-change-extremeweather.html"
url2 = "https://www.foxnews.com/science/climate-change-weather-impact-explained"
1. Based on AI/ML methods, measure the similarity between the two articles.
2. Do they share
a. Similar topics,
b. Sentiment, and
c. Emotions?
3. Identify the top five keywords in each article.
4. Summarize your findings using LLMs.

In [34]:
# Install only libs that don't need external downloads
%pip install -qU "requests==2.32.4" beautifulsoup4 scikit-learn vaderSentiment

In [35]:
import re, textwrap, os, numpy as np, requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

URL_A = "https://www.nasa.gov/climate-change/"
URL_B = "https://www.foxnews.com/science"

def fetch(url: str) -> str:
    r = requests.get(url, timeout=30); r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")
    for bad in soup(["script","style","noscript"]): bad.decompose()
    text = " ".join(soup.get_text(" ").split())
    return re.sub(r"\s+"," ", text).strip()

def tfidf_cosine_and_keywords(a: str, b: str, k=5):
    vec = TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=8000)
    X = vec.fit_transform([a, b])
    cos = float(cosine_similarity(X[0:1], X[1:2])[0][0])
    idx2term = {i:t for t,i in vec.vocabulary_.items()}
    def topk(row):
        arr=row.toarray().ravel(); ii=np.argsort(arr)[::-1]
        return [idx2term[i] for i in ii if arr[i]>0][:k]
    return cos, topk(X[0:1]), topk(X[1:2])

def summarize(text, max_sent=3):
    sents = [s for s in re.split(r'(?<=[.!?])\s+', text) if 20<=len(s)<=400]
    if not sents: return ""
    if len(sents) <= max_sent: return " ".join(sents)
    V = TfidfVectorizer(stop_words="english", max_features=6000); S = V.fit_transform(sents)
    score = (S.power(2).sum(axis=1)).A.ravel(); ix=np.argsort(score)[::-1][:max_sent]; ix.sort()
    return " ".join([sents[i] for i in ix])

# -------------------- Fetch --------------------
A, B = fetch(URL_A), fetch(URL_B)

In [36]:
# -------------------- Q1: Similarity --------------------
cos, kwA, kwB = tfidf_cosine_and_keywords(A, B, k=5)
topic = "High" if cos>=.50 else "Moderate" if cos>=.25 else "Low"
print("="*80)
print("Q1) Similarity")
print("-"*80)
print(f"Chars A: {len(A):,} | Chars B: {len(B):,}")
print(f"Cosine similarity (TF-IDF 1–2 grams): {cos:.3f}")
print(f"Topic overlap (heuristic): {topic}")

Q1) Similarity
--------------------------------------------------------------------------------
Chars A: 13,050 | Chars B: 10,357
Cosine similarity (TF-IDF 1–2 grams): 0.063
Topic overlap (heuristic): Low


In [37]:
# -------------------- Q2: Sentiment --------------------
SIA = SentimentIntensityAnalyzer()
lab = lambda c: "positive" if c>.05 else "negative" if c<-.05 else "neutral"
sA, sB = SIA.polarity_scores(A), SIA.polarity_scores(B)
sA["label"], sB["label"] = lab(sA["compound"]), lab(sB["compound"])
print("\n"+"="*80)
print("Q2) Sentiment")
print("-"*80)
print(f"A: {sA['label']} (compound={sA['compound']:.3f})")
print(f"B: {sB['label']} (compound={sB['compound']:.3f})")
print("Same label?", "Yes" if sA["label"]==sB["label"] else "No")


Q2) Sentiment
--------------------------------------------------------------------------------
A: positive (compound=0.998)
B: negative (compound=-0.880)
Same label? No


In [38]:
# -------------------- Q3: Keywords --------------------
print("\n"+"="*80)
print("Q3) Top 5 keywords (per article)")
print("-"*80)
print("A:", kwA)
print("B:", kwB)


Q3) Top 5 keywords (per article)
--------------------------------------------------------------------------------
A: ['nasa', 'read', 'article', 'min read', 'min']
B: ['fox', 'fox news', 'deals', 'news', 'health']


In [39]:
# -------------------- Q4: Summaries + report --------------------
sumA, sumB = summarize(A,3), summarize(B,3)
print("\n"+"="*80)
print("Q4) Summaries (≤3 sentences) + Consolidated findings")
print("-"*80)
print("A:", textwrap.fill(sumA, 100))
print("\nB:", textwrap.fill(sumB, 100))
overall = (f"Overall, cosine={cos:.3f} → {topic} topical overlap. "
           f"Sentiment labels are {'the same' if sA['label']==sB['label'] else 'different'} "
           f"({sA['label']} vs {sB['label']}).")
print("\nConsolidated Findings:")
print(textwrap.fill(overall, 100))


Q4) Summaries (≤3 sentences) + Consolidated findings
--------------------------------------------------------------------------------
A: article 1 month ago 6 min read Background article 1 month ago 3 min read What’s Up: October 2025
Skywatching Tips from NASA article 1 month ago Highlights 4 min read Exoplanet Watch Overview
article 1 month ago 6 min read Background article 1 month ago 5 min read Discovery Alert: ‘Baby’
Planet Photographed in a Ring around a Star for the First Time! Read More Images of Change Before-
and-after images of Earth. NASA Earth Exchange (NEX) NEX combines state-of-the-art supercomputing,
Earth system modeling, and NASA remote sensing data feeds to deliver a work environment for
exploring and analyzing terabyte- to petabyte-scale datasets covering large regions, continents or
the globe.

B: The Latest Science News Today | Fox News Fox News Media Fox News Media Fox Business Fox Nation Fox
News Audio Fox Weather Outkick Fox Noticias Books Fox News U.S. All rig

NOTE - I have used other URLs reated to similar topics for both the given URLs as the given URL links were expired.