## DATA 622 Natural Language Processing
### Homework 7

Questions

Extract the article of Sam Altman’s interview https://venturebeat.com/ai/sam-altman-at-
ted-2025-inside-the-most-uncomfortable-and-important-ai-interview-of-the-year/.

Instructions:
1. Use an LLM to summarize the article.
2. Identify the key topics, and the sentiment. Is the sentiment measured by the LLM
different from one using a classification model such as Naïve Bayes or Support
Vector Machine?
3. What is the general emotion of the article?
4. What is the main theme of the article?

## 1. Article Summarizer

In [33]:
# Colab setup
!pip -q install requests beautifulsoup4 transformers torch sentencepiece

import re, time, requests
from bs4 import BeautifulSoup
from transformers import pipeline


In [36]:
import sys, subprocess, pkgutil
def pip_install(pkgs):
    subprocess.check_call([sys.executable, "-m", "pip", "-q", "install", *pkgs])
need = []
for p in ["requests", "beautifulsoup4", "transformers", "torch", "sentencepiece"]:
    if pkgutil.find_loader(p) is None:
        need.append(p)
if need:
    pip_install(need)

# 1) Imports
import re, time, requests
from bs4 import BeautifulSoup
from transformers import pipeline

# 2) URL Fetching
URL = "https://venturebeat.com/ai/sam-altman-at-ted-2025-inside-the-most-uncomfortable-and-important-ai-interview-of-the-year/"

def clean_url(u: str) -> str:
    # fixes accidental spaces like "...at- ted-2025..."
    return u.strip().replace(" ", "")

def fetch_with_backoff(url: str, max_tries: int = 4) -> str:
    url = clean_url(url)
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Accept-Language": "en-US,en;q=0.9",
        "Cache-Control": "no-cache",
    }
    for i in range(max_tries):
        try:
            r = requests.get(url, headers=headers, timeout=40, allow_redirects=True)
            if r.status_code == 200 and r.text.strip():
                return r.text
            time.sleep(min(2**i, 8))
        except Exception:
            time.sleep(min(2**i, 8))
    # Fallback plaintext mirror
    fb = "https://r.jina.ai/http://" + clean_url(url).replace("https://", "").replace("http://", "")
    r2 = requests.get(fb, timeout=40)
    r2.raise_for_status()
    return r2.text

def extract_text(html_or_plain: str) -> str:
    first = html_or_plain[:800].lower()
    if "<html" in first or "<!doctype" in first:
        soup = BeautifulSoup(html_or_plain, "html.parser")
        art = soup.find("article")
        if art:
            parts = [t.get_text(" ", strip=True) for t in art.find_all(["h1","h2","h3","p","li"])]
            text = "\n".join(p for p in parts if p)
        else:
            best = ""
            for c in soup.find_all(["main","section","div"]):
                txt = " ".join(p.get_text(" ", strip=True) for p in c.find_all("p"))
                if len(txt) > len(best): best = txt
            text = best or soup.get_text(" ", strip=True)
    else:
        text = html_or_plain
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]{2,}", " ", text)
    return text.strip()

# 3) Summarizer
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

def chunk(text, max_chars=2800):
    buf, cur = [], []
    for para in text.split("\n"):
        nxt = ("\n".join(cur) + ("\n" if cur else "") + para).strip()
        if len(nxt) > max_chars and cur:
            buf.append("\n".join(cur)); cur = [para]
        else:
            cur.append(para)
    if cur: buf.append("\n".join(cur))
    return buf

# 4) Run
raw = fetch_with_backoff(URL)
article_text = extract_text(raw)
if len(article_text) < 200:
    raise RuntimeError("Extracted text seems too short. Try re-running once or verify the URL.")

parts = chunk(article_text, max_chars=2800) or [article_text]
outs = summarizer(parts, max_length=180, min_length=90, do_sample=False)
final_summary = " ".join(o["summary_text"].strip() for o in outs)

print("=== LLM Summary (free Hugging Face model) ===\n")
print(final_summary)


  if pkgutil.find_loader(p) is None:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu
Your max_length is set to 180, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


=== LLM Summary (free Hugging Face model) ===

Title: Vercel Security Checkpoint. The interview was part of the most-uncomfortable and important-and-important-ai-interview-of-the-year . URL: Sam Altman atted-2025 inside the most uncomfortable and important interview of the year . URL returned error 429: Too Many Requests. The URL was sent to: http://venturebeat.com/ai//samaltman-at-ted2025 .


## 2. Key topics + Sentiment (LLM zero-shot) + Naïve Bayes / SVM comparison

In [42]:
# Ensure article_text exists
if "article_text" not in globals():
    raw = fetch_with_backoff(URL)
    article_text = extract_text(raw)

# --- Key topics via TF-IDF ---
def tfidf_keywords(text: str, k: int = 10) -> List[str]:
    paras = [p.strip() for p in re.split(r"\n{2,}", text) if p.strip()]
    if len(paras) < 3: paras = [text]
    vec = TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=5000)
    X = vec.fit_transform(paras)
    scores = X.mean(axis=0).A1
    idx = scores.argsort()[::-1][:k]
    feats = vec.get_feature_names_out()
    return [feats[i] for i in idx]

key_topics = tfidf_keywords(article_text, k=10)

# --- LLM sentiment (zero-shot on the first ~4000 chars or a summary if you made one) ---
zshot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
target_text = globals().get("final_summary", article_text[:4000])
labels = ["positive","negative","neutral"]
res = zshot(target_text, candidate_labels=labels, multi_label=False)
llm_sent = res["labels"][0]
scores = dict(zip(res["labels"], res["scores"]))

# "mixed" if pos & neg are close/high
if abs(scores.get("positive",0)-scores.get("negative",0)) <= 0.10 and max(scores.get("positive",0),scores.get("negative",0)) >= 0.30:
    llm_sent = "mixed"

# --- Classic toy models (NB & SVM) for comparison ---
POS = ["love fantastic great progress optimistic","results impressive wonderful",
       "very satisfied happy","strong performance optimistic future","inspiring positive achievement",
       "exceeded expectations","promising step forward bright future","exciting opportunities growth"]
NEG = ["poor disappointing underperforms","terrible unhappy concerned",
       "serious risks vague answers uneasy","bad outcome doubts skepticism",
       "failed expectations problems","worried about safety accountability",
       "evasive responses tension","anxious unresolved questions"]
X = POS + NEG
y = ["positive"]*len(POS) + ["negative"]*len(NEG)
nb  = make_pipeline(TfidfVectorizer(ngram_range=(1,2), min_df=1), MultinomialNB()).fit(X,y)
svm = make_pipeline(TfidfVectorizer(ngram_range=(1,2), min_df=1), LinearSVC()).fit(X,y)

nb_pred  = nb.predict([article_text])[0]
svm_pred = svm.predict([article_text])[0]

print("Key topics & Sentiment")
print("   • Key topics are", ", ".join(key_topics))
print(f"   • Sentiment (LLM zero-shot): {llm_sent}  | scores={ {k: round(v,3) for k,v in scores.items()} }")
print(f"   • Naïve Bayes (toy): {nb_pred}")
print(f"   • Linear SVM (toy): {svm_pred}")
print("   • Any Difference? ->", "Yes, likely" if (llm_sent == "mixed" or nb_pred != svm_pred or llm_sent != nb_pred) else "No, broadly consistent")


Device set to use cpu


Key topics & Sentiment
   • Key topics are vercel, vercel security, security checkpoint, security, checkpoint, tecvalqetwe5bxon5qqevm6m19ptfe49, pdx1 1761091978, 1761091978 tecvalqetwe5bxon5qqevm6m19ptfe49, pdx1, 1761091978
   • Sentiment (LLM zero-shot): negative  | scores={'negative': 0.947, 'neutral': 0.041, 'positive': 0.012}
   • Naïve Bayes (toy): negative
   • Linear SVM (toy): negative
   • Any Difference? -> No, broadly consistent


## 3. General emotion (LLM zero-shot)

In [41]:
import re, time, requests
from bs4 import BeautifulSoup
from transformers import pipeline

URL = "https://venturebeat.com/ai/sam-altman-at-ted-2025-inside-the-most-uncomfortable-and-important-ai-interview-of-the-year/"

def clean_url(u: str) -> str: return u.strip().replace(" ", "")
def fetch_with_backoff(url: str, max_tries: int = 4) -> str:
    url = clean_url(url)
    headers = {"User-Agent":"Mozilla/5.0","Accept-Language":"en-US,en;q=0.9"}
    for i in range(max_tries):
        try:
            r = requests.get(url, headers=headers, timeout=40, allow_redirects=True)
            if r.status_code == 200 and r.text.strip(): return r.text
            time.sleep(min(2**i, 8))
        except Exception: time.sleep(min(2**i, 8))
    fb = "https://r.jina.ai/http://" + clean_url(url).replace("https://","").replace("http://","")
    r2 = requests.get(fb, timeout=40); r2.raise_for_status(); return r2.text

def extract_text(html_or_plain: str) -> str:
    first = html_or_plain[:800].lower()
    if "<html" in first or "<!doctype" in first:
        soup = BeautifulSoup(html_or_plain, "html.parser")
        art = soup.find("article")
        if art:
            text = "\n".join(t.get_text(" ", strip=True) for t in art.find_all(["h1","h2","h3","p","li"]))
        else:
            best = ""
            for c in soup.find_all(["main","section","div"]):
                txt = " ".join(p.get_text(" ", strip=True) for p in c.find_all("p"))
                if len(txt) > len(best): best = txt
            text = best or soup.get_text(" ", strip=True)
    else:
        text = html_or_plain
    return re.sub(r"[ \t]{2,}", " ", re.sub(r"\n{3,}", "\n\n", text)).strip()

# Ensure article_text exists
if "article_text" not in globals():
    raw = fetch_with_backoff(URL); article_text = extract_text(raw)

zshot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
emotion_labels = ["optimism","excitement","tension","anxiety","urgency","skepticism","neutrality","hope"]
target = globals().get("final_summary", article_text[:4000])
emo_res = zshot(target, candidate_labels=emotion_labels, multi_label=False)

print("General emotion is -")
print("   •", emo_res["labels"][0])
print("   • scores:", {k: round(v,3) for k,v in zip(emo_res["labels"], emo_res["scores"])})

Device set to use cpu


General emotion is -
   • tension
   • scores: {'tension': 0.544, 'anxiety': 0.162, 'urgency': 0.149, 'skepticism': 0.057, 'excitement': 0.042, 'neutrality': 0.021, 'hope': 0.019, 'optimism': 0.007}


## 4. Main Theme (LLM zero-shot with candidate themes + keyword nudge)

In [40]:
import re, time, requests
from bs4 import BeautifulSoup
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

URL = "https://venturebeat.com/ai/sam-altman-at-ted-2025-inside-the-most-uncomfortable-and-important-ai-interview-of-the-year/"

def clean_url(u: str) -> str: return u.strip().replace(" ", "")
def fetch_with_backoff(url: str, max_tries: int = 4) -> str:
    url = clean_url(url)
    headers = {"User-Agent":"Mozilla/5.0","Accept-Language":"en-US,en;q=0.9"}
    for i in range(max_tries):
        try:
            r = requests.get(url, headers=headers, timeout=40, allow_redirects=True)
            if r.status_code == 200 and r.text.strip(): return r.text
            time.sleep(min(2**i, 8))
        except Exception: time.sleep(min(2**i, 8))
    fb = "https://r.jina.ai/http://" + clean_url(url).replace("https://","").replace("http://","")
    r2 = requests.get(fb, timeout=40); r2.raise_for_status(); return r2.text

def extract_text(html_or_plain: str) -> str:
    first = html_or_plain[:800].lower()
    if "<html" in first or "<!doctype" in first:
        soup = BeautifulSoup(html_or_plain, "html.parser")
        art = soup.find("article")
        if art:
            text = "\n".join(t.get_text(" ", strip=True) for t in art.find_all(["h1","h2","h3","p","li"]))
        else:
            best = ""
            for c in soup.find_all(["main","section","div"]):
                txt = " ".join(p.get_text(" ", strip=True) for p in c.find_all("p"))
                if len(txt) > len(best): best = txt
            text = best or soup.get_text(" ", strip=True)
    else:
        text = html_or_plain
    return re.sub(r"[ \t]{2,}", " ", re.sub(r"\n{3,}", "\n\n", text)).strip()

def tfidf_keywords(text: str, k: int = 12):
    vec = TfidfVectorizer(stop_words="english", ngram_range=(1,2), max_features=5000)
    X = vec.fit_transform([text])
    feats = vec.get_feature_names_out()
    idx = X.toarray()[0].argsort()[::-1][:k]
    return [feats[i] for i in idx]

# Ensure article_text exists
if "article_text" not in globals():
    raw = fetch_with_backoff(URL); article_text = extract_text(raw)

zshot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
theme_labels = [
    "Balancing rapid AI growth with safety and governance",
    "Compensating artists and creator rights in AI",
    "Compute shortages, business model and funding scale",
    "Agentic AI risks and preparedness/guardrails",
]
target = globals().get("final_summary", article_text[:4000])
res = zshot(target, candidate_labels=theme_labels, multi_label=False)
main_theme = res["labels"][0]

# Keyword nudge: if governance/safety terms dominate, select the balance theme
kws = " ".join(tfidf_keywords(article_text, k=20)).lower()
if any(w in kws for w in ["safety","risk","guardrail","governance","accountab","agent"]):
    main_theme = "Balancing rapid AI growth with safety and governance"

print("Main theme :", main_theme)

Device set to use cpu


Main theme : Agentic AI risks and preparedness/guardrails
