# Model Logic (Return-Based Direction Classification Using Logistic Regression)

This project predicts **whether NVIDIA’s closing price will go *UP* or *DOWN* on day *t*** using only information available at the end of day *(t–1)*.  
This ensures the model never sees future data and respects real-world trading constraints.

---

## 1. What the Model Uses

For each day *(t–1)*, we use:

- **TF-IDF features of all news headlines from day (t–1)**
- **Return(t–1)**, defined as:

$$
\text{return}(t-1) = \frac{\text{close}(t-1) - \text{close}(t-2)}{\text{close}(t-2)}
$$

These represent all market + news information available before predicting day *t*.

---

## 2. What the Model Predicts

We predict the **direction** of the next day's price movement:

$$
\text{Movement}(t) =
\begin{cases}
1 & \text{if return}(t) > 0 \\
0 & \text{otherwise}
\end{cases}
$$

This makes the task a **binary classification**:

- **1 → stock goes up**  
- **0 → stock goes down or stays flat**

---

## 3. Time-Series Alignment

Because returns describe the *change* between two days, inputs and outputs shift:

| Actual Day | TF-IDF Used | Return Used | Predict Label |
|------------|-------------|-------------|----------------|
| t = 1 | TF-IDF(1) | Return(1) | Movement(2) |
| t = 2 | TF-IDF(2) | Return(2) | Movement(3) |
| t = 3 | TF-IDF(3) | Return(3) | Movement(4) |

Each training sample is aligned as:

**Features:** TF-IDF(t–1), Return(t–1)  
**Target:** Movement(t)

This alignment results in **N–2 valid training samples**.

---

## 4. Example of Alignment

Suppose we have:

| Day | TF-IDF | Close |
|-----|--------|--------|
| 1 | `[0.2, 0.1]` | 100 |
| 2 | `[0.5, 0.0]` | 102 |
| 3 | `[0.3, 0.2]` | 101 |
| 4 | `[0.1, 0.4]` | 103 |

Returns:

| Day | Return |
|-----|--------|
| 2 | (102−100)/100 = 0.02 |
| 3 | (101−102)/102 = −0.0098 |
| 4 | (103−101)/101 = 0.0198 |

Aligned training rows:

| Row | TF-IDF(t–1) | Return(t–1) | Predict Movement(t) |
|-----|-------------|--------------|---------------------|
| 0 | `[0.2, 0.1]` | 0.02 | Movement(2)=1 |
| 1 | `[0.5, 0.0]` | -0.0098 | Movement(3)=0 |
| 2 | `[0.3, 0.2]` | 0.0198 | Movement(4)=1 |

---

## 5. Modeling Pipeline

The full pipeline:

1. **Reduce TF-IDF dimensionality** using Truncated SVD (30 or 50 components).  
2. **Normalize the previous-day return** using StandardScaler.  
3. **Train a Logistic Regression classifier**, tuning penalty and C using GridSearchCV.  
4. **Perform walk-forward prediction** on each test day, always using actual return(t–1).  
5. **Sweep SVD dimensions and decision thresholds** to identify the strongest model.

This produces a fully time-aligned, feature-engineered classifier for predicting NVIDIA’s next-day price direction.

# 1. Imports

In [1]:
import pandas as pd
from pathlib import Path

import numpy as np
from scipy import sparse
import joblib

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler

# 2. Load Data

In [2]:
# paths
DATA_PATH = Path("../data/")
DATA_OUTPUT_PATH = Path("../output/")

In [3]:
# load tfidf matrices
X_train_text = sparse.load_npz(f"{DATA_PATH}/tfidf/X_train_tfidf.npz")
X_test_text  = sparse.load_npz(f"{DATA_PATH}/tfidf/X_test_tfidf.npz")
vectorizer   = joblib.load(f"{DATA_PATH}/tfidf/tfidf_vectorizer.pkl")

df_nvidia = pd.read_csv(DATA_PATH / "NVIDIA_Merged_20241101-Present.csv")
df_nvidia['date'] = pd.to_datetime(df_nvidia['date'])
df_nvidia.head()

Unnamed: 0,language,sourcecountry,seendate,date,url,title,domain,open,high,low,close,adj_close,volume
0,English,Australia,2024-11-18 03:45:00+00:00,2024-11-18,https://www.fool.com.au/2024/11/18/prediction-...,Prediction : Nvidia stock is going to soar aft...,fool.com.au,139.5,141.55,137.15,140.15,140.11,221205300
1,English,Cyprus,2024-11-18 04:00:00+00:00,2024-11-18,https://cyprus-mail.com/2024/11/18/softbank-fi...,SoftBank first to receive new Nvidia chips for...,cyprus-mail.com,139.5,141.55,137.15,140.15,140.11,221205300
2,English,China,2024-11-18 04:00:00+00:00,2024-11-18,https://www.morningstar.com/markets/this-unlov...,Why Small - Cap Value Stocks Look Attractive R...,morningstar.com,139.5,141.55,137.15,140.15,140.11,221205300
3,English,United States,2024-11-18 06:30:00+00:00,2024-11-18,https://247wallst.com/market-news/2024/11/17/n...,Nasdaq Futures Up Sunday Night : NVIDIA Earnin...,247wallst.com,139.5,141.55,137.15,140.15,140.11,221205300
4,English,United States,2024-11-18 11:00:00+00:00,2024-11-18,https://www.benzinga.com/24/11/42029943/dow-tu...,Dow Tumbles Over 300 Points Following Economic...,benzinga.com,139.5,141.55,137.15,140.15,140.11,221205300


# 3. Train/Test Split (Daily)

In [4]:
SPLIT_DATE = pd.Timestamp("2025-11-01")

train_df = df_nvidia[df_nvidia["date"] < SPLIT_DATE]
test_df  = df_nvidia[df_nvidia["date"] >= SPLIT_DATE]

train_daily = (
    train_df.groupby("date")["close"]
    .first()
    .reset_index()
    .sort_values("date")
)

test_daily = (
    test_df.groupby("date")["close"]
    .first()
    .reset_index()
    .sort_values("date")
)

train_dates = train_daily["date"].values
test_dates  = test_daily["date"].values

y_train_all = train_daily["close"].values
y_test_all  = test_daily["close"].values

print("Train days:", len(train_dates))
print("Test days :", len(test_dates))
print("TF-IDF train shape:", X_train_text.shape)
print("TF-IDF test shape :", X_test_text.shape)

Train days: 226
Test days : 15
TF-IDF train shape: (226, 50)
TF-IDF test shape : (15, 50)


# 4. Compute Returns & Time-Series Alignments

In [5]:
train_returns = (y_train_all[1:] - y_train_all[:-1]) / y_train_all[:-1]   # length N-1
test_returns  = (y_test_all[1:] - y_test_all[:-1]) / y_test_all[:-1]      # length M-1

first_test_return_prev = (y_test_all[0] - y_train_all[-1]) / y_train_all[-1]
test_returns = np.r_[first_test_return_prev, test_returns]  # length M

In [6]:
# Labels: movement(t)
y_train = (train_returns[1:] > 0).astype(int)            # length N-2

# Feature: return(t-1)
prev_return_train = train_returns[:-1].reshape(-1, 1)     # length N-2

# Feature: TF-IDF(t-1)
X_train_tfidf_prev = X_train_text[:-2]                    # length N-2

print("TF-IDF aligned:", X_train_tfidf_prev.shape)
print("Returns aligned:", prev_return_train.shape)
print("Labels aligned :", y_train.shape)

TF-IDF aligned: (224, 50)
Returns aligned: (224, 1)
Labels aligned : (224,)


# 5. GridSearch for Optimal Parameters

In [7]:
svd_dims = [30, 50]
thresholds = np.arange(0.4, 0.61, 0.025)

results = []

for dim in svd_dims:
    print(f"\n=== Training SVD={dim} ===")
    
    svd = TruncatedSVD(n_components=dim, random_state=42)
    X_train_svd = svd.fit_transform(X_train_tfidf_prev)
    
    scaler = StandardScaler()
    prev_return_scaled = scaler.fit_transform(prev_return_train)
    
    X_train_iter = np.hstack([X_train_svd, prev_return_scaled])
    
    # hyperparameter search
    param_grid = {
        "penalty": ["l1", "l2"],
        "C": [0.01, 0.1, 1, 5, 10],
        "solver": ["saga"],
        "max_iter": [2000],
    }
    
    logreg = LogisticRegression(class_weight="balanced")
    
    grid = GridSearchCV(
        logreg, param_grid, scoring="f1", cv=5, n_jobs=-1, verbose=0
    )
    grid.fit(X_train_iter, y_train)
    
    best_model = grid.best_estimator_
    
    # prediction
    X_test_svd = svd.transform(X_test_text)
    
    predicted_probs = []
    prev_ret = train_returns[-1]
    
    for i in range(len(test_dates)):
        prev_scaled = scaler.transform([[prev_ret]])
        X_i = np.hstack([X_test_svd[i].reshape(1, -1), prev_scaled])
        
        prob_up = best_model.predict_proba(X_i)[0][1]
        predicted_probs.append(prob_up)
        
        prev_ret = test_returns[i]
    
    actual = (test_returns > 0).astype(int)
    
    # evaluate thresholds
    for thr in thresholds:
        preds = (np.array(predicted_probs) >= thr).astype(int)
        
        acc = accuracy_score(actual, preds)
        prec = precision_score(actual, preds, zero_division=0)
        rec  = recall_score(actual, preds)
        f1   = f1_score(actual, preds)
        
        results.append({
            "svd_dim": dim,
            "threshold": thr,
            "accuracy": acc,
            "precision": prec,
            "recall": rec,
            "f1": f1,
            "confusion": confusion_matrix(actual, preds)
        })



=== Training SVD=30 ===

=== Training SVD=50 ===


# 6. Results

In [8]:
results_df = pd.DataFrame(results)
results_df_sorted = results_df.sort_values("f1", ascending=False)

results_df_sorted.head(10)

Unnamed: 0,svd_dim,threshold,accuracy,precision,recall,f1,confusion
2,30,0.45,0.733333,0.6,1.0,0.75,"[[5, 4], [0, 6]]"
11,50,0.45,0.733333,0.6,1.0,0.75,"[[5, 4], [0, 6]]"
3,30,0.475,0.8,0.8,0.666667,0.727273,"[[8, 1], [2, 4]]"
12,50,0.475,0.8,0.8,0.666667,0.727273,"[[8, 1], [2, 4]]"
4,30,0.5,0.8,1.0,0.5,0.666667,"[[9, 0], [3, 3]]"
13,50,0.5,0.8,1.0,0.5,0.666667,"[[9, 0], [3, 3]]"
1,30,0.425,0.466667,0.428571,1.0,0.6,"[[1, 8], [0, 6]]"
10,50,0.425,0.466667,0.428571,1.0,0.6,"[[1, 8], [0, 6]]"
0,30,0.4,0.4,0.4,1.0,0.571429,"[[0, 9], [0, 6]]"
9,50,0.4,0.4,0.4,1.0,0.571429,"[[0, 9], [0, 6]]"


The grid search identified a clear optimal configuration for predicting NVIDIA's next-day stock direction. The best model used 30 SVD components (or 50, given performance was equivalent) and a classification threshold of 0.45, achieving an f1 score of 0.75. This configuration yielded:
- 100% recall, meaning the model successfully identified all upward price movements
- 60% precision, indicating some false positives but acceptable given the strong recall
- Overall accuracy of 73.33%, performing better than random guessing (50%)

Some general patterns from the results:
- Threshold tuning plays a larger role in the model performance than SVD dimensionality of the tfidf vectors, given that both 30 and 50 tfidf components preserved identical amounts of semantic information from article headlines (only changes precision-recall balance)
- At lower thresholds (0.40-0.425), the model becomes over-aggressive and predicts 'up' on almost every day, which inflates recall and harms precision
- At higher thresholds (0.475-0.50), the model becomes too conservative and misses many upward movements
- The models tend to perform better at identifying 'up' movements than 'down' movements, which is a realistic pattern within the financial market given market asymmetry and the positive skew of headline sentiment.

# 7. Discussion

The results suggest that daily financial headlines contain meaningful predictive signals about NVIDIA's next-day stock direction, especially when combined with the previous day's return. Even a simple linear classifier like logistic regression with L1 regularization can extract enough information from the tfidf values and returns to achieve decent directional accuracy and strong recall.

However, the model's predictive strength is asymmetric, given it is highly sensitive to upward momentum (good recall) but is less reliable at identifying downward movements (occasional false positives). This asymmetry aligns with known properties of financial news, namely, headlines often express positive or optimistic sentiment through bias, and downward market moves are harder to anticipate and might be driven by external shocks that are not captured in daily news. Again, given the similarity between the 30-component and 50-component SVD of the tfidf embeddings, this suggests the tfidf embeddings contain relatively low-rank signals, and increasing embedding dimensionality does not meaningfully improve the predictive power of our model.

Some potential next steps we could take are:
- Adding sentiment scores using VADER or FinBERT alongside tfidf
- Incorporate headline volume (# of articles per day)
- Multi-day history features, like 3-7 day rolling averages

# Random Forest

In [9]:
# prep features
svd_dim = 30  # 30 since 50 had no change

svd = TruncatedSVD(n_components=svd_dim, random_state=42)
X_train_svd_alt = svd.fit_transform(X_train_tfidf_prev)

# scale previous-day returns
scaler_ret_alt = StandardScaler()
prev_return_scaled_alt = scaler_ret_alt.fit_transform(prev_return_train)

# final train matrix
X_train_alt = np.hstack([X_train_svd_alt, prev_return_scaled_alt])

# test features
X_test_svd_alt = svd.transform(X_test_text)

In [10]:
# model
xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=42,
    scale_pos_weight=(y_train==0).sum() / (y_train==1).sum()  # balancing
)

xgb_param_grid = {
    "n_estimators": [100, 200, 400],
    "max_depth": [2, 3, 4],
    "learning_rate": [0.01, 0.05, 0.1],
    "subsample": [0.7, 1.0],
    "colsample_bytree": [0.7, 1.0]
}

xgb_grid = GridSearchCV(
    xgb,
    xgb_param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=1
)

xgb_grid.fit(X_train_alt, y_train)

best_xgb = xgb_grid.best_estimator_
print("\nBest XGBoost params:", xgb_grid.best_params_)

Fitting 5 folds for each of 108 candidates, totalling 540 fits

Best XGBoost params: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100, 'subsample': 1.0}


In [11]:
# pred
xgb_probs = []
prev_ret = train_returns[-1]

for i in range(len(test_dates)):

    prev_scaled = scaler_ret_alt.transform([[prev_ret]])
    X_i = np.hstack([X_test_svd_alt[i].reshape(1, -1), prev_scaled])

    prob_up = best_xgb.predict_proba(X_i)[0][1]
    xgb_probs.append(prob_up)

    prev_ret = test_returns[i]

actual = (test_returns > 0).astype(int)

In [12]:
xgb_thresholds = np.arange(0.4, 0.71, 0.05)

xgb_results = []

for thr in xgb_thresholds:
    preds = (np.array(xgb_probs) >= thr).astype(int)

    acc = accuracy_score(actual, preds)
    prec = precision_score(actual, preds, zero_division=0)
    rec  = recall_score(actual, preds)
    f1   = f1_score(actual, preds)

    xgb_results.append({
        "threshold": thr,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
        "confusion": confusion_matrix(actual, preds)
    })

xgb_results_df = pd.DataFrame(xgb_results)
xgb_results_df.sort_values("f1", ascending=False)

Unnamed: 0,threshold,accuracy,precision,recall,f1,confusion
0,0.4,0.533333,0.4,0.333333,0.363636,"[[6, 3], [4, 2]]"
1,0.45,0.533333,0.333333,0.166667,0.222222,"[[7, 2], [5, 1]]"
2,0.5,0.533333,0.333333,0.166667,0.222222,"[[7, 2], [5, 1]]"
3,0.55,0.533333,0.333333,0.166667,0.222222,"[[7, 2], [5, 1]]"
4,0.6,0.533333,0.0,0.0,0.0,"[[8, 1], [6, 0]]"
5,0.65,0.533333,0.0,0.0,0.0,"[[8, 1], [6, 0]]"
6,0.7,0.6,0.0,0.0,0.0,"[[9, 0], [6, 0]]"


# SVM w/ Linear Kernel

In [13]:
# model
svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", LinearSVC(class_weight="balanced", random_state=42))
])

svm_param_grid = {
    "svm__C": [0.01, 0.1, 1, 5, 10],
    "svm__loss": ["squared_hinge"],
    "svm__max_iter": [2000, 5000]
}

svm_grid = GridSearchCV(
    svm_clf,
    svm_param_grid,
    scoring="f1",
    cv=5,
    n_jobs=-1,
    verbose=1
)

svm_grid.fit(X_train_alt, y_train)

best_svm = svm_grid.best_estimator_
print("\nBest SVM params:", svm_grid.best_params_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits

Best SVM params: {'svm__C': 0.01, 'svm__loss': 'squared_hinge', 'svm__max_iter': 2000}


In [14]:
# pred
predicted_scores = []
prev_ret = train_returns[-1]

for i in range(len(test_dates)):

    X_i = np.hstack([
        X_test_svd_alt[i].reshape(1, -1),
        [[prev_ret]]            # raw previous return (unscaled)
    ])

    margin = best_svm.decision_function(X_i)[0]   # raw SVM margin
    predicted_scores.append(margin)

    prev_ret = test_returns[i]

actual = (test_returns > 0).astype(int)

In [15]:
# threshold sweep
thresholds = np.arange(-1.0, 0.1, 0.1)

svm_results = []

for thr in thresholds:
    preds = (np.array(predicted_scores) >= thr).astype(int)

    acc = accuracy_score(actual, preds)
    prec = precision_score(actual, preds, zero_division=0)
    rec  = recall_score(actual, preds)
    f1   = f1_score(actual, preds)

    svm_results.append({
        "threshold": thr,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
        "confusion": confusion_matrix(actual, preds)
    })

svm_results_df = pd.DataFrame(svm_results)
svm_results_df.sort_values("f1", ascending=False)

Unnamed: 0,threshold,accuracy,precision,recall,f1,confusion
5,-0.5,0.533333,0.461538,1.0,0.631579,"[[2, 7], [0, 6]]"
2,-0.8,0.466667,0.428571,1.0,0.6,"[[1, 8], [0, 6]]"
3,-0.7,0.466667,0.428571,1.0,0.6,"[[1, 8], [0, 6]]"
4,-0.6,0.466667,0.428571,1.0,0.6,"[[1, 8], [0, 6]]"
8,-0.2,0.733333,0.75,0.5,0.6,"[[8, 1], [3, 3]]"
0,-1.0,0.4,0.4,1.0,0.571429,"[[0, 9], [0, 6]]"
1,-0.9,0.4,0.4,1.0,0.571429,"[[0, 9], [0, 6]]"
7,-0.3,0.666667,0.6,0.5,0.545455,"[[7, 2], [3, 3]]"
6,-0.4,0.466667,0.4,0.666667,0.5,"[[3, 6], [2, 4]]"
9,-0.1,0.733333,1.0,0.333333,0.5,"[[9, 0], [4, 2]]"


Overall, across all models evaluated, the core finding is that daily financial news headlines do contain meaningful predictive information about NVIDIA’s next-day stock direction, especially when combined with the previous day's return. Even with a relatively small dataset and simple linear methods, the models were able to extract signals from TF-IDF representations of headlines and perform notably better than random guessing.

Logistic regression was the strongest model overall. With a threshold of 0.45, it achieved an F1 score of 0.75, correctly identifying 100% of upward movements with 60% precision and 73.33% accuracy. The model showed that:
- Thresholds have a larger impact on performance than SVD dimensionality, suggesting that most meaningful headline information lies in a low-rank structure.
- Lower thresholds inflate recall while harming precision, whereas higher thresholds reverse this tradeoff.
- The best performance occurs when recall and precision are balanced through threshold tuning.

Although XGBoost is a powerful nonlinear model in larger datasets, it performed the worst here. This is expected in small, low-signal datasets like ours, where TF-IDF vectors reduce to limited low-rank embeddings. XGBoost underfit significantly, producing low recall and precision, and struggled to extract meaningful nonlinear structure from the available features.

The linear SVM performed okay (better than random guessing). After sweeping decision boundaries in margin space, the best configuration achieved an F1 score of 0.631, with 0.533 accuracy, 1.0 recall, and 0.461 precision. Like logistic regression, SVM performance was highly sensitive to threshold selection, reinforcing that for linear models on this dataset, decision boundary placement is more influential than embedding dimensionality.

Across all models, one consistent pattern emerged: predicting “up” is easier than predicting “down.” This asymmetry is realistic and expected, given that financial headlines tend to be positively skewed, markets drift upward over time, and negative movements are often triggered by external shocks not reflected in daily news text. Consequently, the models achieved strong recall for upward movements but were less reliable at identifying downward days. The similar performance between 30-component and 50-component SVD further indicates that the TF-IDF embeddings contain low-rank latent structure, and increasing representation dimensionality does not substantially improve predictive power.