## DATA 622 Natural Language Processing
### Homework 10

Questions

Use the following article https://www.frontiersin.org/journals/sustainablecities/articles/10.3389/frsc.2023.1308684/full
1. Identify the topics using LDA and LSI.
2. Provide a hierarchical clustering diagram to show the major clusters of topics.
3. Use LLM and transformers to answer these questions plus the top five keywords.

In [11]:
%pip install -qU "requests==2.32.4" beautifulsoup4 scikit-learn scipy matplotlib
%pip install -q --index-url https://download.pytorch.org/whl/cpu torch==2.8.0+cpu
%pip install -q "transformers==4.45.2"

In [12]:
import re, textwrap, numpy as np, requests
from bs4 import BeautifulSoup

URL_INPUT = "https://www.frontiersin.org/journals/sustainablecities/articles/10.3389/frsc.2023.1308684/full"

def canonical_frontiers_url(url: str) -> str:
    m = re.search(r"(10\.3389/[a-z]+\.20\d{2}\.\d+)/full", url)
    return f"https://www.frontiersin.org/articles/{m.group(1)}/full" if m else url

def fetch_clean(url: str) -> str:
    headers = {"User-Agent":"Mozilla/5.0", "Accept":"text/html,application/xhtml+xml"}
    tried=[]
    for candidate in [url, canonical_frontiers_url(url)]:
        if candidate in tried: continue
        tried.append(candidate)
        r = requests.get(candidate, headers=headers, timeout=30, allow_redirects=True)
        if r.status_code==200 and ("<html" in r.text.lower() or "<!doctype" in r.text.lower()):
            soup = BeautifulSoup(r.text, "html.parser")
            for bad in soup(["script","style","noscript"]): bad.decompose()
            txt = " ".join(soup.get_text(" ").split())
            return re.sub(r"\s+"," ", txt).strip()
    raise RuntimeError(f"Unable to fetch article. Tried: {tried}")

doc = fetch_clean(URL_INPUT)

def summarize3(text:str)->str:
    from sklearn.feature_extraction.text import TfidfVectorizer
    sents = [s for s in re.split(r'(?<=[.!?])\s+', text) if 40<=len(s)<=500]
    if not sents: return ""
    if len(sents)<=3: return " ".join(sents)
    V=TfidfVectorizer(stop_words='english',max_features=8000); S=V.fit_transform(sents)
    sc=(S.power(2).sum(axis=1)).A.ravel(); ix=np.argsort(sc)[::-1][:3]; ix.sort()
    return " ".join([sents[i] for i in ix])

# Split to paragraph “documents”
paras = [p.strip() for p in re.split(r'\n{2,}|(?<=\.)\s{2,}', doc)]
doc_chunks = [p for p in paras if len(p) >= 300]

print(f"Fetched chars: {len(doc):,} | chunks used: {len(doc_chunks)}")


Fetched chars: 136,534 | chunks used: 1


In [14]:
# === Q1 (robust): Topic Modeling — LDA & LSI ================================
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD

print("\nQ1) Topic Modeling — LDA & LSI (adaptive df thresholds)")

n_docs = len(doc_chunks)
print(f"Docs (chunks): {n_docs}")

if n_docs < 12:
    merged = []
    buf = []
    for i, p in enumerate(doc_chunks, 1):
        buf.append(p)
        if i % 2 == 0:               # merge every 2 paras
            merged.append(" ".join(buf)); buf=[]
    if buf: merged.append(" ".join(buf))
    doc_chunks = merged
    n_docs = len(doc_chunks)
    print(f"Docs after merging pairs: {n_docs}")

# Set df thresholds safely for small corpora
min_df_val = 1 if n_docs < 30 else 2
max_df_val = 1.0 if n_docs < 50 else 0.95
ngram = (1, 2)

def fit_bow(chunks):
    try:
        bow = CountVectorizer(stop_words="english", min_df=min_df_val, max_df=max_df_val, ngram_range=ngram)
        X = bow.fit_transform(chunks)
        return bow, X
    except ValueError:
        # Relax further on failure
        bow = CountVectorizer(stop_words="english", min_df=1, max_df=1.0, ngram_range=(1,1))
        X = bow.fit_transform(chunks)
        return bow, X

def fit_tfidf(chunks):
    try:
        tfidf = TfidfVectorizer(stop_words="english", min_df=min_df_val, max_df=max_df_val, ngram_range=ngram)
        X = tfidf.fit_transform(chunks)
        return tfidf, X
    except ValueError:
        tfidf = TfidfVectorizer(stop_words="english", min_df=1, max_df=1.0, ngram_range=(1,1))
        X = tfidf.fit_transform(chunks)
        return tfidf, X

bow, X_bow = fit_bow(doc_chunks)
tfidf, X_tfidf = fit_tfidf(doc_chunks)

print(f"Vocab sizes — BoW: {len(bow.get_feature_names_out())}, TFIDF: {len(tfidf.get_feature_names_out())}")

if X_bow.shape[1] < 10 or X_tfidf.shape[1] < 10:
    print("⚠️ Very small vocabulary after filtering. Topics may be noisy.")

# Choose topic counts not exceeding doc count
n_topics = min(6, max(2, n_docs // 3))
n_components = n_topics

# ----- LDA -----
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, learning_method="batch")
lda_W = lda.fit_transform(X_bow)
lda_H = lda.components_
idx2term_bow = np.array(bow.get_feature_names_out())

def top_words(H, idx2term, k=10):
    tops=[]
    for t in range(H.shape[0]):
        ids = np.argsort(H[t])[::-1][:k]
        tops.append(idx2term[ids].tolist())
    return tops

lda_topics = top_words(lda_H, idx2term_bow, k=10)
print(f"\nLDA topics (k={n_topics})")
for i, words in enumerate(lda_topics, 1):
    print(f"  T{i}: {', '.join(words)}")

# ----- LSI (SVD on TF-IDF) -----
svd = TruncatedSVD(n_components=n_components, random_state=42)
lsi_W = svd.fit_transform(X_tfidf)
lsi_H = svd.components_
idx2term_tfidf = np.array(tfidf.get_feature_names_out())
lsi_topics = top_words(lsi_H, idx2term_tfidf, k=10)

print(f"\nLSI topics (k={n_components})")
for i, words in enumerate(lsi_topics, 1):
    print(f"  S{i}: {', '.join(words)}")



Q1) Topic Modeling — LDA & LSI (adaptive df thresholds)
Docs (chunks): 1
Docs after merging pairs: 1
Vocab sizes — BoW: 14429, TFIDF: 14429

LDA topics (k=2)
  T1: climate, change, climate change, google, scholar, google scholar, india, et, et al, al
  T2: climate, change, climate change, google, scholar, google scholar, india, et, et al, al

LSI topics (k=2)
  S1: climate, change, climate change, google, scholar, google scholar, india, al, et al, et


  self.explained_variance_ratio_ = exp_var / full_var


In [16]:
# Q2 — Robust Hierarchical Clustering (handles small/edge cases)
import numpy as np, matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances
from scipy.cluster.hierarchy import linkage, dendrogram

print("\nQ2) Hierarchical Clustering (robust)")

# 1) Start from the chunks built in Q1
chunks = doc_chunks[:]  # assumes doc_chunks exists from earlier cell

# 2) If too few docs, merge neighbors to create bigger documents
if len(chunks) < 2:
    merged, buf = [], []
    for i, p in enumerate(chunks, 1):
        buf.append(p)
        if i % 2 == 0:               # merge every 2 paragraphs
            merged.append(" ".join(buf)); buf = []
    if buf: merged.append(" ".join(buf))
    chunks = merged
    print(f"Chunks after merging: {len(chunks)}")

# 3) If still too few, bail gracefully
if len(chunks) < 2:
    print("⚠️ Not enough documents to cluster (need ≥ 2). Skipping dendrogram.")
else:
    # Rebuild TF-IDF with permissive settings so we actually get features
    tfidf2 = TfidfVectorizer(stop_words="english", min_df=1, max_df=1.0, ngram_range=(1,1))
    X = tfidf2.fit_transform(chunks)

    if X.shape[0] < 2 or X.shape[1] == 0:
        print("⚠️ Not enough signal to cluster (empty or 1×N matrix). Skipping dendrogram.")
    else:
        D = cosine_distances(X)
        condensed = D[np.triu_indices_from(D, k=1)]

        if condensed.size == 0:
            print("⚠️ Distance matrix is empty (all documents identical?). Skipping dendrogram.")
        else:
            Z = linkage(condensed, method="average")
            plt.figure(figsize=(11, 5))
            plt.title("Dendrogram — TF-IDF + Cosine + Average-linkage")
            dendrogram(Z, labels=[f"P{i}" for i in range(len(chunks))], leaf_rotation=90)
            plt.tight_layout()
            plt.savefig("/content/dendrogram.png", dpi=160)
            plt.show()
            print("✅ Saved dendrogram → /content/dendrogram.png")


Q2) Hierarchical Clustering (robust)
Chunks after merging: 1
⚠️ Not enough documents to cluster (need ≥ 2). Skipping dendrogram.


In [17]:
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

print("\nQ3) LLM (zero-shot) + Top 5 Keywords")

# Zero-shot (small MNLI model for CPU speed)
zs = pipeline("zero-shot-classification", model="typeform/distilbert-base-uncased-mnli", device=-1)
labels = [
    "sustainable cities","urban planning","transportation","energy",
    "climate change","waste management","water management",
    "green infrastructure","policy and governance","community engagement",
    "technology and innovation","housing","air quality"
]
snippet = doc[:2000]
llm_out = zs(snippet, labels, multi_label=True)
llm_scores = {lab: float(sc) for lab, sc in zip(llm_out["labels"], llm_out["scores"])}
total = sum(llm_scores.values()) or 1e-9
llm_pct = {k: 100*(v/total) for k,v in llm_scores.items()}
print("LLM topic emphasis (%):")
for lab in labels: print(f"  {lab:>24s}: {llm_pct.get(lab,0):6.2f}%")

# Top-5 keywords (global TF-IDF on full article)
tfidf_all = TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=8000)
vocab = np.array(tfidf_all.fit([doc]).get_feature_names_out())
weights = tfidf_all.transform([doc]).toarray().ravel()
top_idx = np.argsort(weights)[::-1][:5]
top_keywords = vocab[top_idx].tolist()
print("\nTop 5 keywords:", top_keywords)

# Brief summary + save report
summary3 = summarize3(doc)
report = f"""# Assignment — Topics & LLM on Frontiers Article

**URL (input):** {URL_INPUT}

## Q1) Topics
**LDA (top words):**
{chr(10).join([f"- T{i+1}: " + ", ".join(ws) for i, ws in enumerate(lda_topics)])}
**LSI (top words):**
{chr(10).join([f"- S{i+1}: " + ", ".join(ws) for i, ws in enumerate(lsi_topics)])}

## Q2) Hierarchical Clustering
Dendrogram saved at: `/content/dendrogram.png`

## Q3) LLM (zero-shot) + Keywords
{chr(10).join([f"- {lab}: {llm_pct.get(lab,0):.2f}%" for lab in labels])}

**Top 5 keywords:** {", ".join(top_keywords)}
**Brief summary:** {summary3}
"""
with open("/content/frontiers_assignment_report.md","w",encoding="utf-8") as f:
    f.write(report)
print("\nSaved report → /content/frontiers_assignment_report.md")


Q3) LLM (zero-shot) + Top 5 Keywords


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/258 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


LLM topic emphasis (%):
        sustainable cities:  12.22%
            urban planning:   5.52%
            transportation:   5.82%
                    energy:   8.88%
            climate change:  11.28%
          waste management:   2.06%
          water management:   2.47%
      green infrastructure:  12.47%
     policy and governance:  12.89%
      community engagement:   8.83%
  technology and innovation:  11.45%
                   housing:   3.53%
               air quality:   2.58%

Top 5 keywords: ['climate', 'change', 'climate change', 'google', 'google scholar']

Saved report → /content/frontiers_assignment_report.md
