# Tech Trends News Agent

An AI-powered news analysis system that retrieves technology articles, enables semantic search, generates conversational answers using a local LLM, and visualizes trending topics.

## Quick Start

1. Click the `.ipynb` notebook file in this repository
2. Click **Open in Colab** at the top
3. Set runtime to **GPU** (Runtime ‚Üí Change runtime type ‚Üí T4 GPU)
4. Run all cells (Runtime ‚Üí Run all)
5. Click the ngrok URL when it appears to open the app

---

## Setup and Run Instructions

### 1. Prerequisites

- Google account (for Colab)
- That's it! Everything else runs in the cloud.

---

### 2. Run in Google Colab

1. Click the notebook file (`Copy_of_12_4_New_AI_project.ipynb`) above
2. Click **Open in Colab** button at the top of the file
3. Go to **Runtime ‚Üí Change runtime type ‚Üí Select GPU (T4)**
4. Run all cells in order (Runtime ‚Üí Run all)
5. Wait for the ngrok URL to appear (~60 seconds for model loading)
6. Click the ngrok URL to open the Streamlit app

---

### 3. API Key

You **do not need to create your own NewsAPI key**. A working API key is already included in the project configuration. The agent will automatically use it when retrieving articles.

---

### 4. Project Structure

```
tech-trends-agent/
‚îú‚îÄ‚îÄ app.py                 # Streamlit web interface
‚îú‚îÄ‚îÄ fetch_news.py          # NewsAPI article retrieval
‚îú‚îÄ‚îÄ preprocess.py          # NLTK tokenization and preprocessing
‚îú‚îÄ‚îÄ search_articles.py     # TF-IDF search and topic extraction
‚îú‚îÄ‚îÄ llm_interface.py       # Qwen2.5-1.5B model wrapper
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/               # Raw JSON from NewsAPI
‚îÇ   ‚îî‚îÄ‚îÄ processed/         # Preprocessed article corpus
```

---

### 5. Using the App

#### Chat Tab
1. Click **"Fetch Articles"** in the sidebar to load latest technology news
2. Type a question (e.g., "What are the latest AI trends?") or click a suggestion
3. View the LLM-generated answer with expandable source citations

#### Topic Visualization Tab
1. After fetching articles, switch to the **"Topic Visualization"** tab
2. View bar chart of top 5 discovered topics
3. Expand each topic to see key terms, sample articles, and sub-trends

---

### 6. Output

- **Chat Responses**: Contextual answers displayed in the chat interface with source links
- **Topic Clusters**: Automatically extracted from article corpus using TF-IDF and co-occurrence analysis
- **Sub-Trends**: Relevance-scored sub-topics within each main topic cluster
- **Metrics**: Total articles, categorized count, and coverage percentage

---

### 7. Notes for Graders

- **Easiest path**: Run the Colab notebook with GPU runtime enabled
- The API key is pre-configured ‚Äî no setup required for NewsAPI
- Model loading takes ~60 seconds on first run (downloading ~3GB)
- Once loaded, queries are answered in 2-5 seconds
- If NewsAPI rate limits are hit, previously fetched articles are cached in `data/processed/`
- The app automatically loads cached articles if available

#### To Test:
1. Run all cells in the notebook
2. Click "Fetch Articles" to retrieve latest news
3. Ask questions in the chat interface
4. Check the Topic Visualization tab for trend analysis

---

### 8. Technical Details

- **LLM**: Qwen2.5-1.5B-Instruct
- **Search**: TF-IDF with cosine similarity
- **Preprocessing**: NLTK (tokenize, stopwords, lemmatize)
- **Topic Extraction**: Co-occurrence clustering
- **Frontend**: Streamlit
- **Data Source**: NewsAPI (technology headlines)
- **GPU Support**: CUDA (float16 precision)

---

### 9. Troubleshooting

- **"No articles loaded"**: Click "Fetch Articles" in sidebar
- **Model loading slow**: Ensure GPU runtime is enabled in Colab
- **ngrok URL not working**: Re-run the last cell to generate new tunnel
- **NewsAPI error**: Rate limit reached; wait or use cached data
- **CUDA out of memory**: Restart runtime and try again

In [None]:
# ===CELL 1: Setup and Install Dependencies===
#@title 1. Check GPU & Install Dependencies
!nvidia-smi
!pip install -q transformers torch accelerate nltk streamlit pyngrok newsapi-python

import os
os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)
print("Directories created")




Tue Dec  9 00:58:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   75C    P0             35W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
#@title 2. Create fetch_news.py
%%writefile fetch_news.py
import os
import json
from datetime import datetime, timedelta
from newsapi import NewsApiClient

TECH_DOMAINS = "techcrunch.com,theverge.com,wired.com,arstechnica.com,engadget.com,venturebeat.com,techradar.com,gizmodo.com,reuters.com,forbes.com,zdnet.com,cnet.com,mashable.com,thenextweb.com,businessinsider.com,bloomberg.com,bbc.com,theguardian.com,nytimes.com,washingtonpost.com,apnews.com,cnbc.com,axios.com,protocol.com,techmeme.com,9to5mac.com,9to5google.com,androidcentral.com,macrumors.com,tomshardware.com,pcmag.com,pcworld.com,computerworld.com,infoworld.com,theregister.com,siliconangle.com,techspot.com,digitaltrends.com,tomsguide.com,howtogeek.com,phonearena.com,gsmarena.com"
def fetch_news_articles(api_key: str, from_date: str, to_date: str,
                        query: str = "technology",
                        max_articles: int = 100) -> list:
    newsapi = NewsApiClient(api_key=api_key)
    all_articles = []
    page = 1
    page_size = min(100, max_articles)

    print(f"Fetching articles from {from_date} to {to_date}...")

    while len(all_articles) < max_articles:
        try:
            response = newsapi.get_everything(
                domains=TECH_DOMAINS,
                from_param=from_date,
                to=to_date,
                language='en',
                sort_by='publishedAt',
                page_size=page_size,
                page=page
            )
            articles = response.get('articles', [])
            print(f"  Page {page}: found {len(articles)} articles")
            if not articles:
                break
            all_articles.extend(articles)
            total_results = response.get('totalResults', 0)
            print(f"  Total available: {total_results}")
            if len(all_articles) >= total_results or len(all_articles) >= max_articles:
                break
            page += 1
        except Exception as e:
            print(f"Error fetching page {page}: {e}")
            break

    all_articles = all_articles[:max_articles]
    os.makedirs("data/raw", exist_ok=True)
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    raw_path = f"data/raw/newsapi_{timestamp}.json"
    with open(raw_path, 'w', encoding='utf-8') as f:
        json.dump(all_articles, f, indent=2, ensure_ascii=False)
    print(f"[OK] Saved {len(all_articles)} articles to {raw_path}")
    return all_articles

def get_date_range_options():
    today = datetime.now()
    return {
        "Last 24 hours": (today - timedelta(days=1), today),
        "Last 3 days": (today - timedelta(days=3), today),
        "Last 7 days": (today - timedelta(days=7), today),
        "Last 14 days": (today - timedelta(days=14), today),
        "Last 30 days": (today - timedelta(days=30), today),
    }

Overwriting fetch_news.py


In [None]:
# ===CELL 3: Create preprocess.py
#@title 3. Create preprocess.py
%%writefile preprocess.py
import os, re, json
from datetime import datetime
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    if not text: return []
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if re.match(r"[a-zA-Z0-9]+", t)]
    tokens = [t for t in tokens if t not in stop_words]
    return [lemmatizer.lemmatize(t) for t in tokens]

def preprocess(articles=None):
    if articles is None:
        raw_dir = "data/raw"
        files = sorted([f for f in os.listdir(raw_dir) if f.endswith(".json")], reverse=True)
        if not files: raise FileNotFoundError("No raw files")
        with open(os.path.join(raw_dir, files[0]), "r", encoding="utf-8") as f:
            articles = json.load(f)

    print(f"Processing {len(articles)} articles...")
    processed = []

    for idx, art in enumerate(articles, 1):
        title = art.get("title") or ""
        desc = art.get("description") or ""
        content = re.sub(r"\[\+\d+ chars\]$", "", art.get("content") or "").strip()
        readable_text = f"{title}. {desc}" if desc else title
        tokens = clean_text(f"{title} {desc} {content}")
        source = art.get("source", {})
        source_name = source.get("name", "Unknown") if isinstance(source, dict) else str(source or "Unknown")
        processed.append({
            "id": f"article_{idx}",
            "url": art.get("url", ""),
            "title": title,
            "description": desc,
            "text": readable_text,
            "source": source_name,
            "published": art.get("publishedAt", ""),
            "tokens": tokens
        })

    os.makedirs("data/processed", exist_ok=True)
    out_path = f"data/processed/corpus_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(processed, f, indent=2, ensure_ascii=False)
    print(f"[OK] Saved {len(processed)} articles to {out_path}")
    return processed

if __name__ == "__main__":
    preprocess()

Overwriting preprocess.py


In [None]:
#@title 5. Create search_articles.py (with Topic Visualization)
%%writefile search_articles.py
import os, re, math, json
from collections import defaultdict, Counter
from typing import List, Dict, Any, Tuple
import itertools

def tokenize(text: str) -> List[str]:
    return re.findall(r"[a-zA-Z0-9']+", text.lower()) if text else []

def compute_tf(tokens):
    if not tokens: return {}
    counts = defaultdict(int)
    for t in tokens: counts[t] += 1
    return {t: c/len(tokens) for t,c in counts.items()}

def compute_df(doc_tokens):
    df = defaultdict(int)
    for tokens in doc_tokens:
        for t in set(tokens): df[t] += 1
    return df

def tfidf_vector(tokens, idf):
    tf = compute_tf(tokens)
    return {t: tf[t] * idf.get(t, 0) for t in tf}

def cosine_similarity(a, b):
    if not a or not b: return 0.0
    shared = set(a.keys()) & set(b.keys())
    if not shared: return 0.0
    dot = sum(a[k]*b[k] for k in shared)
    na, nb = math.sqrt(sum(v*v for v in a.values())), math.sqrt(sum(v*v for v in b.values()))
    return dot/(na*nb) if na and nb else 0.0

def load_processed(processed_dir="data/processed"):
    files = sorted([f for f in os.listdir(processed_dir) if f.endswith(".json")], reverse=True)
    if not files: raise FileNotFoundError("No processed files")
    with open(os.path.join(processed_dir, files[0]), "r", encoding="utf-8") as f:
        return json.load(f)

def search_corpus(query: str, corpus: List[Dict], k: int = 5) -> List[Dict]:
    if not query or not corpus: return []
    query_lower = query.lower()
    skip_words = {"what", "are", "the", "latest", "tell", "me", "about", "is", "happening", "with"}
    query_keywords = [w for w in tokenize(query) if w not in skip_words]
    if any(w in query_lower for w in ["ai", "artificial intelligence"]):
        query_keywords = ["ai", "artificial", "intelligence"] + query_keywords
    docs = []
    for d in corpus:
        tokens = d.get("tokens", [])
        title_tokens = tokenize(d.get("title", ""))
        docs.append({"id": d["id"], "title": d.get("title", ""), "text": d.get("text", d.get("description", "")),
            "source": d.get("source", "Unknown"), "published": d.get("published", ""), "url": d.get("url", ""),
            "tokens": tokens + title_tokens * 3})
    doc_tokens = [d["tokens"] for d in docs]
    df = compute_df(doc_tokens)
    n = len(docs)
    idf = {t: math.log((n+1)/(df[t]+1))+1 for t in df}
    doc_vecs = [tfidf_vector(d["tokens"], idf) for d in docs]
    q_vec = tfidf_vector(query_keywords or tokenize(query), idf)
    if not q_vec: return []
    scored = sorted([(cosine_similarity(q_vec, dv), i) for i,dv in enumerate(doc_vecs)], reverse=True)
    results = []
    for score, i in scored:
        if score > 0:
            d = docs[i]
            snippet = d.get("text", "")[:250] if d.get("text") else "(no content)"
            results.append({"id": d["id"], "title": d["title"], "score": round(score, 4), "snippet": snippet,
                "source": d["source"], "published": d["published"], "url": d["url"]})
        if len(results) >= k: break
    return results

# Dynamic Topic Extraction
STOPWORDS = {
    "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with",
    "by", "from", "as", "is", "was", "are", "were", "been", "be", "have", "has", "had",
    "do", "does", "did", "will", "would", "could", "should", "may", "might", "must",
    "it", "its", "this", "that", "these", "those", "i", "you", "he", "she", "we", "they",
    "what", "which", "who", "whom", "when", "where", "why", "how", "all", "each", "every",
    "both", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only",
    "own", "same", "so", "than", "too", "very", "just", "can", "now", "new", "said",
    "says", "say", "also", "get", "got", "one", "two", "first", "last", "year", "years",
    "time", "way", "even", "back", "after", "use", "make", "made", "want", "see", "look",
    "into", "over", "out", "up", "down", "about", "through", "between", "under", "again",
    "there", "here", "then", "once", "during", "before", "being", "any", "many", "much",
    "while", "against", "part", "based", "using", "used", "according", "around", "well",
    "going", "come", "take", "thing", "things", "really", "still", "since", "without",
    "need", "needs", "like", "know", "think", "work", "working", "works", "people",
    "company", "companies", "percent", "million", "billion", "including", "however",
    "though", "another", "something", "nothing", "everything", "anything", "someone",
    "everyone", "today", "week", "month", "day", "days", "weeks", "months", "report",
    "reported", "reports", "news", "article", "story", "via", "per", "etc", "ie", "eg",
    "read", "best", "good", "better", "available", "right", "left", "big", "small",
    "long", "short", "high", "low", "old", "young", "early", "late", "next", "last"
}

def extract_ngrams(tokens: List[str], n: int = 2) -> List[str]:
    if len(tokens) < n: return []
    return [" ".join(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def get_document_terms(doc: Dict) -> List[str]:
    # Use the pre-processed tokens and title from the corpus
    title_tokens = tokenize(doc.get("title", ""))
    text_tokens = tokenize(doc.get("text", ""))
    content_tokens = doc.get("tokens", [])

    title_filtered = [t for t in title_tokens if t not in STOPWORDS and len(t) > 2]
    text_filtered = [t for t in text_tokens if t not in STOPWORDS and len(t) > 2]
    content_filtered = [t for t in content_tokens if t not in STOPWORDS and len(t) > 2]

    # Weight title terms more heavily
    terms = title_filtered * 5 + text_filtered * 2 + content_filtered
    title_bigrams = extract_ngrams(title_filtered, 2)
    return terms + title_bigrams

def extract_top_terms(corpus: List[Dict], top_n: int = 50) -> List[Tuple[str, float]]:
    doc_terms = [get_document_terms(doc) for doc in corpus]
    df = defaultdict(int)
    for terms in doc_terms:
        for term in set(terms): df[term] += 1
    n_docs = len(corpus)
    term_scores = defaultdict(float)
    for terms in doc_terms:
        tf = Counter(terms)
        max_tf = max(tf.values()) if tf else 1
        for term, count in tf.items():
            tfidf = (count / max_tf) * math.log((n_docs + 1) / (df[term] + 1))
            term_scores[term] += tfidf
    sorted_terms = sorted(term_scores.items(), key=lambda x: x[1], reverse=True)
    min_docs, max_docs = max(2, n_docs * 0.02), n_docs * 0.7
    filtered = [(t, s) for t, s in sorted_terms if min_docs <= df[t] <= max_docs and len(t) > 2]
    return filtered[:top_n]

def cluster_terms_into_topics(corpus: List[Dict], n_topics: int = 5) -> List[Dict]:
    top_terms = extract_top_terms(corpus, top_n=100)
    term_set = {t for t, _ in top_terms}
    cooccurrence = defaultdict(lambda: defaultdict(int))
    term_to_docs = defaultdict(set)
    for doc_idx, doc in enumerate(corpus):
        doc_terms = set(get_document_terms(doc)) & term_set
        for term in doc_terms: term_to_docs[term].add(doc_idx)
        for t1, t2 in itertools.combinations(list(doc_terms), 2):
            cooccurrence[t1][t2] += 1
            cooccurrence[t2][t1] += 1
    used_terms, topics = set(), []
    for seed_term, seed_score in top_terms:
        if seed_term in used_terms: continue
        related = [(seed_term, seed_score)]
        candidates = sorted(cooccurrence[seed_term].items(), key=lambda x: x[1], reverse=True)
        for term, cooc_count in candidates:
            if term not in used_terms and len(related) < 8 and cooc_count >= 2:
                related.append((term, cooc_count))
        if len(related) >= 2:
            topic_terms = [t for t, _ in related]
            used_terms.update(topic_terms)
            topic_docs = set()
            for term in topic_terms[:3]: topic_docs.update(term_to_docs[term])
            name_parts = [t.title() for t, _ in related[:2]]
            topic_name = " & ".join(name_parts)
            # Get actual article titles for this topic
            sample_articles = [corpus[idx].get("title", "Unknown") for idx in list(topic_docs)[:3]]
            topics.append({"name": topic_name, "terms": topic_terms, "count": len(topic_docs),
                "doc_indices": topic_docs, "sample_articles": sample_articles})
        if len(topics) >= n_topics: break
    topics.sort(key=lambda x: x["count"], reverse=True)
    return topics[:n_topics]

def extract_subtrends(corpus: List[Dict], topic: Dict, n_subtrends: int = 5) -> List[Dict]:
    topic_docs = [corpus[i] for i in topic["doc_indices"] if i < len(corpus)]
    if not topic_docs: return []
    main_terms = set(topic["terms"])
    term_counts, term_docs = Counter(), defaultdict(set)
    for doc_idx, doc in enumerate(topic_docs):
        for term in get_document_terms(doc):
            if term not in main_terms and term not in STOPWORDS and len(term) > 2:
                term_counts[term] += 1
                term_docs[term].add(doc_idx)
    min_docs, max_docs = max(1, len(topic_docs) * 0.1), len(topic_docs) * 0.8
    subtrend_candidates = [(t, c) for t, c in term_counts.most_common(50) if min_docs <= len(term_docs[t]) <= max_docs]
    subtrends, used = [], set()
    for term, count in subtrend_candidates:
        if term in used: continue
        relevance = len(term_docs[term]) / len(topic_docs)
        subtrends.append({"name": term.title(), "relevance": round(min(0.95, relevance + 0.2), 2),
            "article_count": len(term_docs[term])})
        used.add(term)
        if len(subtrends) >= n_subtrends: break
    if len(subtrends) < n_subtrends:
        for term, count in term_counts.most_common(20):
            if term not in used and len(subtrends) < n_subtrends:
                subtrends.append({"name": term.title(), "relevance": round(0.3 + (count / len(topic_docs)) * 0.4, 2),
                    "article_count": count})
    return sorted(subtrends, key=lambda x: x["relevance"], reverse=True)[:n_subtrends]

def get_topic_visualization_data(corpus: List[Dict]) -> Dict[str, Any]:
    topics = cluster_terms_into_topics(corpus, n_topics=5)
    viz_data = {"topics": [], "subtrends": {}}
    for topic in topics:
        topic_name = topic["name"]
        viz_data["topics"].append({"name": topic_name, "count": topic["count"],
            "sample_articles": topic["sample_articles"], "key_terms": topic["terms"][:5]})
        subtrends = extract_subtrends(corpus, topic, n_subtrends=5)
        viz_data["subtrends"][topic_name] = subtrends
    return viz_data

Overwriting search_articles.py


In [None]:
# ===CELL 6: Create llm_interface.py===
#@title 6. Create llm_interface.py
%%writefile llm_interface.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

class HF_LLM:
    def __init__(self, model_name=MODEL_NAME, max_new_tokens=256):
        self.model_name = model_name
        self.max_new_tokens = max_new_tokens

        if torch.cuda.is_available():
            print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
            self.device = "cuda"
            self.dtype = torch.float16
        else:
            print("‚ö†Ô∏è No GPU, using CPU")
            self.device = "cpu"
            self.dtype = torch.float32

        print(f"Loading {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name, device_map="auto" if self.device == "cuda" else None,
            torch_dtype=self.dtype, trust_remote_code=True
        )
        self.gen_cfg = GenerationConfig(max_new_tokens=max_new_tokens, temperature=0.7, do_sample=True)
        print(f"‚úÖ Model loaded on {next(self.model.parameters()).device}")


Overwriting llm_interface.py


In [None]:
#@title 7. Create app.py
%%writefile app.py
import streamlit as st
import re, torch, os
import pandas as pd
from datetime import datetime, timedelta

from fetch_news import fetch_news_articles, get_date_range_options
from preprocess import preprocess
from search_articles import load_processed, search_corpus, get_topic_visualization_data
from llm_interface import HF_LLM

st.set_page_config(page_title="Tech Trends Agent", layout="wide")

NEWSAPI_KEY = "821aeebd32f34b238e99e1282b9eb935"

@st.cache_resource
def load_llm():
    return HF_LLM()


def get_viz_data(_corpus):
    return get_topic_visualization_data(_corpus)

def generate_answer(llm, question, results):
    context = ""
    for r in results[:5]:
        context += f"- {r['title']}: {r['snippet'][:200]}\n\n"
    prompt = f"""Based on these recent tech news articles:

{context}

Write a 3-4 sentence paragraph answering: {question}

Be specific and mention actual details from the articles. Write naturally.

Answer:"""
    inputs = llm.tokenizer(prompt, return_tensors="pt").to(llm.model.device)
    with torch.no_grad():
        output = llm.model.generate(
            **inputs, max_new_tokens=200, temperature=0.7,
            do_sample=True, pad_token_id=llm.tokenizer.eos_token_id
        )
    answer = llm.tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True).strip()
    answer = answer.split('\n\n')[0].strip()
    return re.sub(r'^Answer:\s*', '', answer)

def format_date(date_str):
    if not date_str: return ""
    try:
        dt = datetime.fromisoformat(date_str.replace('Z', '+00:00'))
        return dt.strftime("%b %d, %Y")
    except: return ""

def process_question(question, llm, corpus):
    results = search_corpus(question, corpus, k=5)
    answer = generate_answer(llm, question, results)
    return answer, results

if 'chat_history' not in st.session_state:
    st.session_state.chat_history = []
if 'corpus' not in st.session_state:
    st.session_state.corpus = None
if 'data_loaded' not in st.session_state:
    st.session_state.data_loaded = False

with st.sidebar:
    st.header("Data Source")
    st.success("API Key configured")
    st.divider()
    st.subheader("Date Range")
    date_options = get_date_range_options()
    range_choice = st.selectbox("Quick Select", ["Custom"] + list(date_options.keys()))
    if range_choice == "Custom":
        col1, col2 = st.columns(2)
        with col1:
            from_date = st.date_input("From", datetime.now() - timedelta(days=7))
        with col2:
            to_date = st.date_input("To", datetime.now())
    else:
        from_date, to_date = date_options[range_choice]
        from_date = from_date.date() if hasattr(from_date, 'date') else from_date
        to_date = to_date.date() if hasattr(to_date, 'date') else to_date
        st.info(f"{from_date} to {to_date}")
    max_articles = st.slider("Max Articles", 10, 100, 100, step=10)
    st.divider()
    if st.button("Fetch Articles", type="primary", use_container_width=True):
        with st.spinner("Fetching articles..."):
            try:
                articles = fetch_news_articles(api_key=NEWSAPI_KEY, from_date=str(from_date), to_date=str(to_date), max_articles=max_articles)
                if articles:
                    with st.spinner("Processing..."):
                        st.session_state.corpus = preprocess(articles)
                        st.session_state.data_loaded = True
                        st.session_state.chat_history = []
                        st.cache_data.clear()
                    st.success(f"Loaded {len(st.session_state.corpus)} articles!")
                else:
                    st.warning("No articles found")
            except Exception as e:
                st.error(f"Error: {str(e)}")
    if not st.session_state.data_loaded:
        try:
            st.session_state.corpus = load_processed()
            st.session_state.data_loaded = True
            st.info(f"Loaded {len(st.session_state.corpus)} cached articles")
        except:
            pass
    st.divider()
    if st.button("Clear Chat", use_container_width=True):
        st.session_state.chat_history = []
        st.rerun()
    if st.button("Re-analyze Topics", use_container_width=True):
        st.cache_data.clear()
        st.rerun()

st.title("Tech Trends News Agent")

if not st.session_state.data_loaded or st.session_state.corpus is None:
    st.warning("No articles loaded. Please fetch articles from the sidebar.")
else:
    llm = load_llm()
    corpus = st.session_state.corpus
    tab1, tab2 = st.tabs(["Chat", "Topic Visualization"])

    with tab1:
        st.caption(f"Ask questions about {len(corpus)} recent tech news articles")

        # Chat container with fixed height for scrolling
        chat_container = st.container(height=500)

        with chat_container:
            if not st.session_state.chat_history:
                st.markdown("### Try asking:")
                col1, col2 = st.columns(2)
                suggestions = ["What are the latest AI trends?", "Tell me about tech news", "What are the biggest tech stories?", "Any news about smartphones?"]
                for i, sug in enumerate(suggestions):
                    with (col1 if i % 2 == 0 else col2):
                        if st.button(sug, key=f"sug_{i}", use_container_width=True):
                            st.session_state.pending_question = sug
                            st.rerun()
            else:
                for msg in st.session_state.chat_history:
                    with st.chat_message(msg["role"]):
                        st.write(msg["content"])
                        if msg["role"] == "assistant" and "sources" in msg:
                            with st.expander("View Sources"):
                                for s in msg["sources"]:
                                    st.markdown(f"**[{s['title']}]({s.get('url', '#')})**")
                                    st.caption(f"{s.get('source', 'Unknown')} | {format_date(s.get('published', ''))} | Score: {s['score']:.1%}")

        # Process pending question from button click
        if 'pending_question' in st.session_state:
            question = st.session_state.pending_question
            del st.session_state.pending_question
            st.session_state.chat_history.append({"role": "user", "content": question})
            with st.spinner("Searching..."):
                answer, results = process_question(question, llm, corpus)
            st.session_state.chat_history.append({"role": "assistant", "content": answer, "sources": results})
            st.rerun()

        # Chat input always at bottom
        if prompt := st.chat_input("Ask about technology trends..."):
            st.session_state.chat_history.append({"role": "user", "content": prompt})
            with st.spinner("Searching..."):
                answer, results = process_question(prompt, llm, corpus)
            st.session_state.chat_history.append({"role": "assistant", "content": answer, "sources": results})
            st.rerun()

    with tab2:
        st.header("Discovered Topics & Sub-Trends")
        viz_data = get_viz_data(corpus)
        if not viz_data["topics"]:
            st.warning("No topics found.")
        else:
            st.subheader("Top 5 Topics")
            topic_df = pd.DataFrame({"Topic": [t["name"] for t in viz_data["topics"]], "Articles": [t["count"] for t in viz_data["topics"]]})
            st.bar_chart(topic_df.set_index("Topic"))
            st.subheader("Topic Deep Dive")
            for i, topic_info in enumerate(viz_data["topics"]):
                topic_name = topic_info["name"]
                subtrends = viz_data["subtrends"].get(topic_name, [])
                with st.expander(f"{i+1}. {topic_name} ({topic_info['count']} articles)", expanded=(i==0)):
                    if topic_info.get("key_terms"):
                        st.markdown("**Key Terms:** " + " ".join([f"`{t}`" for t in topic_info["key_terms"]]))
                    col1, col2 = st.columns(2)
                    with col1:
                        st.markdown("**Sample Articles:**")
                        for j, title in enumerate(topic_info.get("sample_articles", [])[:3], 1):
                            st.markdown(f"{j}. {title[:60]}...")
                    with col2:
                        st.markdown("**Sub-Trends:**")
                        for st_info in subtrends[:5]:
                            st.markdown(f"**{st_info['name']}**")
                            st.progress(st_info["relevance"])
            st.divider()
            col1, col2, col3 = st.columns(3)
            total_cat = sum(t["count"] for t in viz_data["topics"])
            col1.metric("Total Articles", len(corpus))
            col2.metric("Categorized", total_cat)
            col3.metric("Coverage", f"{total_cat/len(corpus)*100:.0f}%")

Overwriting app.py


In [None]:
# ===CELL 8: Launch the App===
#@title 8. Launch Streamlit App
!pip install -q pyngrok
!pkill -f streamlit

from pyngrok import ngrok


ngrok.set_auth_token("36aLZAaAh8rvnvKvJEyLtC6HM3Y_6hwzr4BhX2FZVwVLSPgC8")

public_url = ngrok.connect(8501)
print("="*60)
print(f"üåê OPEN THIS URL: {public_url}")
print("="*60)
print("\n‚è≥ Wait ~60 seconds for model to load, then use the app!")

!streamlit run app.py --server.port 8501 --server.headless true &

üåê OPEN THIS URL: NgrokTunnel: "https://subaqueous-unusably-oralee.ngrok-free.dev" -> "http://localhost:8501"

‚è≥ Wait ~60 seconds for model to load, then use the app!

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m




[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.83.185.180:8501[0m
[0m
Fetching articles from 2025-12-02 to 2025-12-09...
  Page 1: found 98 articles
  Total available: 5649
Error fetching page 2: {'status': 'error', 'code': 'maximumResultsReached', 'message': 'You have requested too many results. Developer accounts are limited to a max of 100 results. You are trying to request results 100 to 200. Please upgrade to a paid plan if you need more results.'}
[OK] Saved 98 articles to data/raw/newsapi_20251209_010140.json
Processing 98 articles...
[OK] Saved 98 articles to data/processed/corpus_20251209_010143.json
‚úÖ GPU: Tesla T4
Loading Qwen/Qwen2.5-1.5B-Instruct...
`torch_dtype` is deprecated! Use `dtype` instead!
2025-12-09 01:01:44.751113: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable