# AI-Driven News Sentiment & Weekly Summarization for Stock Movement (NASDAQ)
**Created:** 2025-08-28

> **Important:** In Colab, go to **Runtime → Change runtime type → Hardware accelerator = GPU** (T4 or "GPU"). Then **Save**.
>
> **Submission:** Export this notebook to **HTML** (File → Download → Download as .ipynb, then convert to HTML) or use Colab's *Print* to PDF/HTML.

**What you'll do in this notebook**
1. Load daily news + OHLCV for a single NASDAQ-listed company.
2. EDA (univariate/bivariate) with clear plots and insights.
3. Preprocess text & create train/validation/test splits (time-aware).
4. Build 3 embedding pipelines:
   - Word2Vec (train on corpus)
   - GloVe (pretrained)
   - Sentence-Transformer (MiniLM)
5. Train a classifier for each embedding (with basic hyperparameter search).
6. Evaluate with appropriate metrics & pick the best model; test-set results.
7. Weekly news summarization using an open-source LLM from Hugging Face.
8. Actionable insights & recommendations.

> **Data format required (CSV):**
>
> - `Date` (YYYY-MM-DD)
> - `News` (string)
> - `Open`, `High`, `Low`, `Close` (float)
> - `Volume` (int)
> - `Label` in \{-1, 0, 1\} (Negative, Neutral, Positive)

---
## 0) Environment Setup
Run the cell below once in Colab. It installs the required libraries.

In [None]:
# If running on Colab, this will install/upgrade all deps.
!pip -q install --upgrade pip
!pip -q install numpy pandas scikit-learn matplotlib gensim nltk imbalanced-learn
!pip -q install sentence-transformers transformers accelerate torch --extra-index-url https://download.pytorch.org/whl/cu121 || pip -q install torch

---
## 1) Imports, Config, and Utilities

In [None]:
import os
import re
import gc
import json
import math
import random
import string
import warnings
from datetime import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_recall_fscore_support, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

from gensim.models import Word2Vec, KeyedVectors
import gensim.downloader as api

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sentence_transformers import SentenceTransformer

warnings.filterwarnings('ignore')

# NLTK setup
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)

plt.rcParams['figure.figsize'] = (7,4)

---
## 2) Data Loading
- Upload your CSV or mount Google Drive and set a path.

In [None]:
#@title 🔼 Upload your dataset (CSV) here
# In Colab, use the file uploader widget (uncomment next lines)
# from google.colab import files
# uploaded = files.upload()
# CSV_FILE = list(uploaded.keys())[0]

# OR set a path manually (e.g., from Drive)
CSV_FILE = "news_stock_sample.csv"  # Change this to your filename

# Load
df = pd.read_csv(CSV_FILE)

# Basic checks
expected_cols = {'Date','News','Open','High','Low','Close','Volume','Label'}
missing = expected_cols - set(df.columns)
assert not missing, f"Missing required columns: {missing}"

# Parse dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values('Date').reset_index(drop=True)

print("Rows:", len(df))
df.head()

---
## 3) Problem Definition
**Goal:** Predict **news sentiment** (Label: -1, 0, 1) from text, and connect weekly summaries to price action.

**Business Value:** Better sentiment nowcasting → improved trading decisions, position sizing, and risk management.

---
## 4) Exploratory Data Analysis (EDA)
### 4.1 Univariate
- Daily price/volume distributions
- News length distribution
- Label class balance

In [None]:
# Descriptive stats for numeric columns
display(df[['Open','High','Low','Close','Volume']].describe())

# Add aux features
df['news_len'] = df['News'].astype(str).str.split().apply(len)

# Plots
df['news_len'].hist(bins=40); plt.title('News length (tokens)'); plt.xlabel('Tokens'); plt.ylabel('Count'); plt.show()

df['Label'].value_counts().sort_index().plot(kind='bar'); plt.title('Label distribution (-1, 0, 1)'); plt.xlabel('Label'); plt.ylabel('Count'); plt.show()

df.set_index('Date')['Close'].plot(); plt.title('Close over time'); plt.ylabel('Price'); plt.show()

df.set_index('Date')['Volume'].plot(); plt.title('Volume over time'); plt.ylabel('Shares'); plt.show()

### 4.2 Bivariate
- Label vs price change
- Label vs volume

In [None]:
# Price change %
df['pct_change'] = df['Close'].pct_change()*100

# Group by label for insights
label_grp = df.groupby('Label')['pct_change'].agg(['mean','median','count'])
display(label_grp)

# Boxplots: pct_change by label
df.boxplot(column='pct_change', by='Label'); plt.title('Pct change by Sentiment Label'); plt.suptitle(''); plt.ylabel('%'); plt.show()

# Correlations
numerics = df[['Open','High','Low','Close','Volume','news_len']].corr()
print(numerics)

> 📝 **Key EDA Observations (fill in after running):**
- Class balance & potential imbalance.
- Relationship between sentiment and next-day returns (optional to add lag).
- Any outliers or missing values?

---
## 5) Preprocessing
- Clean text (lowercase, remove URLs/punct, lemmatize).
- Define X (text) and y (labels).
- Time-aware split: train (60%), valid (20%), test (20%) in chronological order.

In [None]:
lemm = WordNetLemmatizer()
stops = set(stopwords.words('english'))

def clean_text(s:str) -> str:
    s = str(s).lower()
    s = re.sub(r"http\S+|www\S+", " ", s)
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    tokens = [lemm.lemmatize(t) for t in s.split() if t not in stops and len(t) > 2]
    return " ".join(tokens)

df['clean'] = df['News'].astype(str).apply(clean_text)

X = df['clean'].values
y = df['Label'].values

# Chronological split
n = len(df)
train_end = int(0.6*n)
valid_end = int(0.8*n)

X_train, y_train = X[:train_end], y[:train_end]
X_valid, y_valid = X[train_end:valid_end], y[train_end:valid_end]
X_test,  y_test  = X[valid_end:], y[valid_end:]

print(len(X_train), len(X_valid), len(X_test))

---
## 6) Word Embeddings
We'll create three embedding strategies and related vectorizers:

1. **Word2Vec (train on our corpus)** → average word vectors per document  
2. **GloVe (pretrained `glove-wiki-gigaword-100`)** → average vectors  
3. **Sentence-Transformer (`all-MiniLM-L6-v2`)** → direct sentence embeddings

In [None]:
# 6.1 Word2Vec — train on our corpus
tokenized = [s.split() for s in X_train]
w2v = Word2Vec(sentences=tokenized, vector_size=100, window=5, min_count=2, workers=4, seed=42, epochs=20)
w2v_kv = w2v.wv

def doc_embed_w2v(text, kv=w2v_kv, dim=100):
    toks = text.split()
    vecs = [kv[w] for w in toks if w in kv]
    if not vecs:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

def embed_corpus_w2v(corpus):
    return np.vstack([doc_embed_w2v(s) for s in corpus])

Xtr_w2v = embed_corpus_w2v(X_train)
Xva_w2v = embed_corpus_w2v(X_valid)
Xte_w2v = embed_corpus_w2v(X_test)
Xtr_w2v.shape, Xva_w2v.shape, Xte_w2v.shape

In [None]:
# 6.2 GloVe — pretrained 100d
glove = api.load('glove-wiki-gigaword-100')  # downloads once in Colab
def doc_embed_glove(text, kv=glove, dim=100):
    toks = text.split()
    vecs = [kv[w] for w in toks if w in kv]
    if not vecs:
        return np.zeros(dim)
    return np.mean(vecs, axis=0)

def embed_corpus_glove(corpus):
    return np.vstack([doc_embed_glove(s) for s in corpus])

Xtr_glove = embed_corpus_glove(X_train)
Xva_glove = embed_corpus_glove(X_valid)
Xte_glove = embed_corpus_glove(X_test)
Xtr_glove.shape, Xva_glove.shape, Xte_glove.shape

In [None]:
# 6.3 Sentence-Transformer — MiniLM
from sentence_transformers import SentenceTransformer
st_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
Xtr_st = st_model.encode(list(X_train), batch_size=64, show_progress_bar=True, convert_to_numpy=True)
Xva_st = st_model.encode(list(X_valid), batch_size=64, show_progress_bar=True, convert_to_numpy=True)
Xte_st = st_model.encode(list(X_test),  batch_size=64, show_progress_bar=True, convert_to_numpy=True)
Xtr_st.shape, Xva_st.shape, Xte_st.shape

---
## 7) Modeling & Hyperparameter Tuning
We'll test two lightweight classifiers:
- **Logistic Regression** (multinomial)
- **Linear SVM (LinearSVC)**

> **Primary metric:** **Macro F1** (balances performance across {-1,0,1} even if classes are imbalanced). We'll also report Accuracy and per-class metrics.

In [None]:
from sklearn.metrics import confusion_matrix

def evaluate_and_print(y_true, y_pred, title=""):
    print(f"\n=== {title} ===")
    print("Accuracy:", round(accuracy_score(y_true, y_pred), 4))
    print("Macro F1:", round(f1_score(y_true, y_pred, average='macro'), 4))
    print("\nClassification Report:\n", classification_report(y_true, y_pred, digits=4))
    cm = confusion_matrix(y_true, y_pred, labels=[-1,0,1])
    print("\nConfusion Matrix (rows=true, cols=pred, labels=[-1,0,1]):\n", cm)

# Hyperparameter spaces
logreg_params = {
    'C': np.logspace(-2, 2, 10),
    'penalty': ['l2'],
    'solver': ['lbfgs'],
    'max_iter': [500, 1000]
}

svm_params = {
    'C': np.logspace(-2, 2, 10),
    'loss': ['hinge', 'squared_hinge']
}

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import RandomizedSearchCV

def tune_and_eval(Xtr, ytr, Xva, yva, space, model_name="logreg"):
    if model_name == "logreg":
        base = LogisticRegression(multi_class='auto', random_state=42)
        search = RandomizedSearchCV(base, space, n_iter=12, scoring='f1_macro',
                                    cv=3, random_state=42, n_jobs=-1, verbose=0)
    else:
        base = LinearSVC(random_state=42)
        search = RandomizedSearchCV(base, space, n_iter=12, scoring='f1_macro',
                                    cv=3, random_state=42, n_jobs=-1, verbose=0)
    search.fit(Xtr, ytr)
    best = search.best_estimator_
    yhat = best.predict(Xva)
    evaluate_and_print(yva, yhat, title=f"{model_name} (best on validation)")
    return best

results = {}

# Word2Vec
print("Tuning on Word2Vec embeddings...")
best_lr_w2v = tune_and_eval(Xtr_w2v, y_train, Xva_w2v, y_valid, logreg_params, "logreg")
best_svm_w2v = tune_and_eval(Xtr_w2v, y_train, Xva_w2v, y_valid, svm_params, "svm")
results['w2v_lr'] = best_lr_w2v
results['w2v_svm'] = best_svm_w2v

# GloVe
print("\nTuning on GloVe embeddings...")
best_lr_glove = tune_and_eval(Xtr_glove, y_train, Xva_glove, y_valid, logreg_params, "logreg")
best_svm_glove = tune_and_eval(Xtr_glove, y_train, Xva_glove, y_valid, svm_params, "svm")
results['glove_lr'] = best_lr_glove
results['glove_svm'] = best_svm_glove

# Sentence-Transformer
print("\nTuning on Sentence-Transformer embeddings...")
best_lr_st = tune_and_eval(Xtr_st, y_train, Xva_st, y_valid, logreg_params, "logreg")
best_svm_st = tune_and_eval(Xtr_st, y_train, Xva_st, y_valid, svm_params, "svm")
results['st_lr'] = best_lr_st
results['st_svm'] = best_svm_st

### 7.1 Model Selection
Pick the **best validation Macro F1** model and evaluate it on the **test set**.

In [None]:
from sklearn.metrics import f1_score

def macro_f1_on_valid(clf, Xva, yva):
    yhat = clf.predict(Xva)
    return f1_score(yva, yhat, average='macro')

candidates = {
    'w2v_lr': (results['w2v_lr'], Xva_w2v),
    'w2v_svm': (results['w2v_svm'], Xva_w2v),
    'glove_lr': (results['glove_lr'], Xva_glove),
    'glove_svm': (results['glove_svm'], Xva_glove),
    'st_lr': (results['st_lr'], Xva_st),
    'st_svm': (results['st_svm'], Xva_st),
}

best_name, best_clf, best_feats = None, None, None
best_score = -1.0
for name, (clf, Xva_) in candidates.items():
    score = macro_f1_on_valid(clf, Xva_, y_valid)
    print(f"{name}: val Macro F1 = {score:.4f}")
    if score > best_score:
        best_score = score
        best_name, best_clf, best_feats = name, clf, Xva_
print(f"\nBest model on validation: {best_name} (Macro F1={best_score:.4f})")

# Test-set features based on winner
if best_name.startswith('w2v'):
    Xte = Xte_w2v
elif best_name.startswith('glove'):
    Xte = Xte_glove
else:
    Xte = Xte_st

yhat_test = best_clf.predict(Xte)
evaluate_and_print(y_test, yhat_test, title=f"FINAL TEST — {best_name}")

---
## 8) Weekly News Summarization (LLM)
**Task:** For each week, identify **Top 3 Positive** and **Top 3 Negative** events likely to impact price.

We will:
1. Group by ISO week (`Date`).
2. Concatenate that week's news text.
3. Use an instruction-tuned model (`google/flan-t5-base`) to produce structured bullets.
4. Parse into a tidy DataFrame.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

def load_llm(model_name='google/flan-t5-base', device=None):
    if device is None:
        device = 0 if (hasattr(torch, 'cuda') and torch.cuda.is_available()) else -1
    tok = AutoTokenizer.from_pretrained(model_name)
    mdl = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto")
    gen = pipeline('text2text-generation', model=mdl, tokenizer=tok, device=device)
    return gen

def chunk_text(text, max_words=400):
    words = text.split()
    for i in range(0, len(words), max_words):
        yield " ".join(words[i:i+max_words])

def summarize_week(gen, text):
    partial_summaries = []
    for chunk in chunk_text(text, max_words=350):
        prompt = "Summarize the following financial news in 3-5 concise bullet points focusing on stock price drivers:\n\n" + chunk
        out = gen(prompt, max_new_tokens=180, temperature=0.2, do_sample=False)[0]['generated_text']
        partial_summaries.append(out)

    combined = "\n".join(partial_summaries)
    instruction = (
        "From the summary below, list exactly 3 Positive and 3 Negative events that could move the stock. "
        "Return in this strict template:\n"
        "POS:\n1) ...\n2) ...\n3) ...\nNEG:\n1) ...\n2) ...\n3) ...\n\n"
        "SUMMARY:\n" + combined
    )
    final = gen(instruction, max_new_tokens=220, temperature=0.1, do_sample=False)[0]['generated_text']
    return final

def parse_structured_events(text):
    pos, neg = [], []
    pos_block = re.search(r"POS\s*:\s*(.*?)\n\s*NEG\s*:", text, flags=re.S|re.I)
    neg_block = re.search(r"NEG\s*:\s*(.*)", text, flags=re.S|re.I)
    if pos_block:
        lines = re.findall(r"\d+\)\s*(.+)", pos_block.group(1))
        pos = [l.strip(" -•") for l in lines][:3]
    if neg_block:
        lines = re.findall(r"\d+\)\s*(.+)", neg_block.group(1))
        neg = [l.strip(" -•") for l in lines][:3]
    while len(pos) < 3: pos.append("")
    while len(neg) < 3: neg.append("")
    return pos, neg

def weekly_summaries(df):
    gen = load_llm()
    weekly = (df
              .set_index('Date')
              .groupby(pd.Grouper(freq='W-MON', label='right'))
              .agg({'News': lambda x: "\n".join(x.astype(str)),
                    'Close': 'last', 'Volume': 'sum'})
              .dropna(subset=['News']))
    records = []
    for week_end, row in weekly.iterrows():
        text = row['News']
        result = summarize_week(gen, text)
        pos, neg = parse_structured_events(result)
        records.append({
            'week_end': week_end.date().isoformat(),
            'pos_1': pos[0], 'pos_2': pos[1], 'pos_3': pos[2],
            'neg_1': neg[0], 'neg_2': neg[1], 'neg_3': neg[2],
            'close_last': row['Close'], 'volume_sum': row['Volume']
        })
    return pd.DataFrame(records)

weekly_df = weekly_summaries(df)
weekly_df.head()

---
## 9) Actionable Insights & Recommendations
Fill in after observing your results.

**Insights (examples to refine):**
- **Sentence-Transformer** embeddings typically outperform averaged word vectors on short texts; if your validation **Macro F1** is highest for ST, prefer that.
- Neutral class is often hardest; consider thresholding approaches or label smoothing.
- Weekly summaries can highlight catalysts (earnings, product launches, regulatory news) — align with spikes in volume/returns.
- Time-aware split avoids leakage; for production, use rolling retrains and walk-forward validation.

**Recommendations:**
1. **Productionize the best pipeline** (embedding + classifier) behind an inference API; track latency and drift.
2. **Add event features** (earnings dates, macro releases) to improve price movement models.
3. **Use weekly LLM summaries** in analyst notes and as features (e.g., sentiment counts) for forecasting.
4. **Monitor class balance** over time; retrain with recent data and consider focal loss or class weights.
5. **Backtest** trading rules using the sentiment signal (e.g., long when 7-day sentiment z-score > threshold).

---
## 10) Save Artifacts
Save weekly summaries and export model if needed.

In [None]:
# Save weekly summaries
weekly_path = "weekly_summaries.csv"
df_weekly_out = weekly_df.copy()
df_weekly_out.to_csv(weekly_path, index=False)
print("Saved:", weekly_path)

# (Optional) Save the final classifier and any embedding assets
import joblib
joblib.dump(best_clf, f"best_model_{best_name}.joblib")
print("Saved:", f"best_model_{best_name}.joblib")

---
## 11) Appendix: Tips & Extensions
- Add **next-day** or **multi-day forward returns** labels to connect sentiment → price.
- Try **class weights** or **CalibratedClassifierCV** for better neutrality handling.
- Replace Flan-T5 with **BART** summarization (`facebook/bart-large-cnn`) if preferred.
- Use **wandb** or **mlflow** for experiment tracking.
- Convert this notebook to **HTML** for submission.