# Short‑Term Trading Forecasts: ML + (Optional) DL

This notebook is built for **more accurate** short‑term forecasting using robust, reproducible methods:

1. **Data**: Load from CSV **or** download with `yfinance`.
2. **Features**: Returns, volatility, lag features, and common technical indicators (SMA/EMA, RSI, MACD, Bollinger Bands, Stochastic Oscillator, ATR, OBV).
3. **Validation**: **Walk‑Forward** cross‑validation (time‑series safe splits).
4. **Models** (choose any subset):
   - Baselines: Naive (last value), Simple Moving Average
   - **HistGradientBoostingRegressor** (fast, strong tabular baseline)
   - **RandomForestRegressor**
   - **Optional Deep Learning**: LSTM (Keras / TensorFlow). If TF isn't installed, skip those cells.
5. **Ensembling**: Simple blending of top models.
6. **Metrics**: RMSE, MAE, MAPE, **Directional Accuracy**, and a **toy backtest** (long if next‑day return > threshold).

**Targets**: Predict next‑day **log return** (safer, stationary). You can switch to price directly if you prefer.

**Tip**: Start with ML models; add LSTM once you verify environment has TensorFlow (`pip install tensorflow`).

## 0) Setup

In [None]:
# If you want to use yfinance, uncomment the next line in your own environment:
# !pip install yfinance ta tensorflow

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import Optional, List, Dict

from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_hist_gradient_boosting  # noqa: F401
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import matplotlib.pyplot as plt

# For saving models
import joblib

# Optional deep learning
try:
    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    TF_AVAILABLE = True
except Exception as e:
    TF_AVAILABLE = False
    print("TensorFlow not available; LSTM cells can be skipped.")

## 1) Load Data (CSV or yfinance)

In [None]:
# === Option A: CSV (recommended for reproducibility) ===
# Provide a path to a CSV with at least: Date, Open, High, Low, Close, Volume
# The notebook expects 'Date' to be parseable and unique.
CSV_PATH = None  # e.g., 'your_prices.csv' or keep None to use yfinance

# === Option B: yfinance (requires internet) ===
USE_YFINANCE = True
TICKER = 'SPY'
START = '2015-01-01'
END = None  # None = up to latest

def load_data(csv_path: Optional[str], use_yf: bool, ticker: str, start: str, end: Optional[str]) -> pd.DataFrame:
    if csv_path:
        df = pd.read_csv(csv_path)
        if 'Date' not in df.columns:
            raise ValueError("CSV must contain a 'Date' column.")
        df['Date'] = pd.to_datetime(df['Date'])
        df = df.sort_values('Date')
        # Normalize column names
        cols = {c.lower(): c for c in df.columns}
        # Try to map common names
        rename_map = {}
        for c in df.columns:
            cl = c.lower()
            if cl in ['open','high','low','close','adj close','volume']:
                rename_map[c] = c.title() if cl != 'adj close' else 'Adj Close'
        df = df.rename(columns=rename_map)
        # If 'Adj Close' missing, create a copy from 'Close'
        if 'Adj Close' not in df.columns and 'Close' in df.columns:
            df['Adj Close'] = df['Close']
        return df
    else:
        if not use_yf:
            raise ValueError("Either provide CSV_PATH or enable USE_YFINANCE.")
        import yfinance as yf
        data = yf.download(ticker, start=start, end=end, auto_adjust=False)
        data = data.reset_index()
        data.rename(columns={'Adj Close': 'Adj Close'}, inplace=True)
        return data

prices = load_data(CSV_PATH, USE_YFINANCE, TICKER, START, END)
assert {'Date','Open','High','Low','Close','Volume'}.issubset(set(prices.columns)), "Data must have OHLCV columns."
prices = prices.dropna().reset_index(drop=True)
prices.tail()

## 2) Feature Engineering
We build:
- Log returns and forward returns (target)
- Rolling stats (volatility, mean)
- Technical indicators: SMA/EMA, RSI, MACD, Bollinger Bands, Stochastic, ATR, OBV
- Lagged features

In [None]:
def add_technical_features(df: pd.DataFrame) -> pd.DataFrame:
    d = df.copy()
    d['log_close'] = np.log(d['Adj Close'] if 'Adj Close' in d.columns else d['Close'])
    d['ret_1'] = d['log_close'].diff()
    d['ret_5'] = d['log_close'].diff(5)
    d['ret_10'] = d['log_close'].diff(10)
    d['vol_5'] = d['ret_1'].rolling(5).std()
    d['vol_10'] = d['ret_1'].rolling(10).std()

    # SMA/EMA
    for w in [5,10,20,50]:
        d[f'sma_{w}'] = d['Close'].rolling(w).mean()
        d[f'ema_{w}'] = d['Close'].ewm(span=w, adjust=False).mean()

    # Bollinger (20,2)
    d['bb_mid'] = d['Close'].rolling(20).mean()
    d['bb_std'] = d['Close'].rolling(20).std()
    d['bb_up'] = d['bb_mid'] + 2*d['bb_std']
    d['bb_dn'] = d['bb_mid'] - 2*d['bb_std']
    d['bb_width'] = (d['bb_up'] - d['bb_dn'])/d['bb_mid']

    # RSI (14)
    delta = d['Close'].diff()
    gain = np.where(delta>0, delta, 0.0)
    loss = np.where(delta<0, -delta, 0.0)
    roll_up = pd.Series(gain).rolling(14).mean()
    roll_down = pd.Series(loss).rolling(14).mean()
    rs = roll_up / (roll_down + 1e-9)
    d['rsi_14'] = 100.0 - (100.0 / (1.0 + rs))

    # MACD (12,26,9)
    ema12 = d['Close'].ewm(span=12, adjust=False).mean()
    ema26 = d['Close'].ewm(span=26, adjust=False).mean()
    d['macd'] = ema12 - ema26
    d['macd_sig'] = d['macd'].ewm(span=9, adjust=False).mean()
    d['macd_hist'] = d['macd'] - d['macd_sig']

    # Stochastic (14)
    low14 = d['Low'].rolling(14).min()
    high14 = d['High'].rolling(14).max()
    d['stoch_k'] = 100 * (d['Close'] - low14) / (high14 - low14 + 1e-9)
    d['stoch_d'] = d['stoch_k'].rolling(3).mean()

    # ATR (14)
    tr1 = d['High'] - d['Low']
    tr2 = (d['High'] - d['Close'].shift()).abs()
    tr3 = (d['Low'] - d['Close'].shift()).abs()
    tr = pd.concat([tr1, tr2, tr3], axis=1).max(axis=1)
    d['atr_14'] = tr.rolling(14).mean()

    # OBV
    direction = np.sign(d['Close'].diff().fillna(0.0))
    d['obv'] = (direction * d['Volume']).cumsum()

    # Lags of key signals
    for col in ['ret_1','vol_5','vol_10','rsi_14','macd','macd_sig','macd_hist','stoch_k','stoch_d','bb_width','atr_14','obv']:
        for lag in [1,2,3,5]:
            d[f'{col}_lag{lag}'] = d[col].shift(lag)

    # Target: next-day log return
    d['y_ret1_ahead'] = d['ret_1'].shift(-1)

    d = d.dropna().reset_index(drop=True)
    return d

feat = add_technical_features(prices)
feat.tail()

## 3) Train / Validation Split (Walk‑Forward CV)

In [None]:
# Use the last ~20% as final test; rest for CV
split_idx = int(len(feat)*0.8)
train_df = feat.iloc[:split_idx].copy()
test_df  = feat.iloc[split_idx:].copy()

features = [c for c in feat.columns if c not in ['Date','Open','High','Low','Close','Adj Close','Volume','log_close','y_ret1_ahead']]
target = 'y_ret1_ahead'

X_train, y_train = train_df[features], train_df[target]
X_test,  y_test  = test_df[features],  test_df[target]

tscv = TimeSeriesSplit(n_splits=5)

## 4) Baselines

In [None]:
def evaluate(y_true, y_pred, label: str):
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    mape = (np.abs((y_true - y_pred) / (y_true + 1e-9))).mean()*100
    direction_true = (y_true > 0).astype(int)
    direction_pred = (y_pred > 0).astype(int)
    dir_acc = (direction_true == direction_pred).mean()
    return {'model': label, 'RMSE': rmse, 'MAE': mae, 'MAPE%': mape, 'DirAcc': dir_acc}

# Naive baseline: predict today's return as tomorrow's
y_pred_naive = train_df['ret_1'].shift(0).reindex(test_df.index).fillna(0.0)  # simple proxy; not leaking future info
baseline_naive = evaluate(y_test.values, y_pred_naive.values, 'Naive(ret_t)')

# SMA baseline on returns (5-day mean of returns)
sma5 = train_df['ret_1'].rolling(5).mean().reindex(test_df.index).fillna(0.0)
baseline_sma5 = evaluate(y_test.values, sma5.values, 'SMA(5)_ret')

pd.DataFrame([baseline_naive, baseline_sma5])

## 5) Strong ML Models

In [None]:
# HistGradientBoosting Regressor
hgb = HistGradientBoostingRegressor(random_state=42)
param_dist_hgb = {
    'max_depth': [3, 5, 7, None],
    'learning_rate': [0.02, 0.05, 0.1],
    'max_leaf_nodes': [15, 31, 63, 127],
    'min_samples_leaf': [10, 20, 50, 100]
}

search_hgb = RandomizedSearchCV(
    estimator=hgb,
    param_distributions=param_dist_hgb,
    n_iter=20,
    scoring='neg_root_mean_squared_error',
    cv=tscv,
    random_state=42,
    n_jobs=-1,
    verbose=0
)

search_hgb.fit(X_train, y_train)
best_hgb = search_hgb.best_estimator_
y_pred_hgb = best_hgb.predict(X_test)

# RandomForest
rf = RandomForestRegressor(random_state=42, n_estimators=500, max_depth=None, min_samples_leaf=50, n_jobs=-1)
param_dist_rf = {
    'n_estimators': [300, 500, 800],
    'max_depth': [None, 8, 12],
    'min_samples_leaf': [20, 50, 100]
}

search_rf = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist_rf,
    n_iter=15,
    scoring='neg_root_mean_squared_error',
    cv=tscv,
    random_state=42,
    n_jobs=-1,
    verbose=0
)

search_rf.fit(X_train, y_train)
best_rf = search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)

results_ml = pd.DataFrame([
    evaluate(y_test.values, y_pred_hgb, 'HistGB'),
    evaluate(y_test.values, y_pred_rf,  'RandomForest'),
])
results_ml

## 6) Optional: LSTM (Deep Learning)
If TensorFlow is available, we build a compact LSTM over a sliding window of features.

In [None]:
WINDOW = 20  # days lookback
USE_DL = TF_AVAILABLE

def make_sequences(X: pd.DataFrame, y: pd.Series, window: int):
    Xv, yv = [], []
    arr = X.values.astype(np.float32)
    tgt = y.values.astype(np.float32)
    for i in range(window, len(X)):
        Xv.append(arr[i-window:i, :])
        yv.append(tgt[i])
    return np.array(Xv), np.array(yv)

dl_train_df = train_df.copy()
dl_test_df  = test_df.copy()

if USE_DL:
    scaler = StandardScaler()
    Xtr_s = scaler.fit_transform(dl_train_df[features].values)
    Xte_s = scaler.transform(dl_test_df[features].values)

    Xtr_s = pd.DataFrame(Xtr_s, index=dl_train_df.index, columns=features)
    Xte_s = pd.DataFrame(Xte_s, index=dl_test_df.index, columns=features)

    Xtr_seq, ytr_seq = make_sequences(Xtr_s, dl_train_df[target], WINDOW)
    Xte_seq, yte_seq = make_sequences(pd.concat([Xtr_s.tail(WINDOW), Xte_s]), pd.concat([dl_train_df[target].tail(WINDOW), dl_test_df[target]]), WINDOW)

    model = keras.Sequential([
        layers.Input(shape=(WINDOW, len(features))),
        layers.LSTM(64, return_sequences=False),
        layers.Dense(32, activation='relu'),
        layers.Dense(1)
    ])
    model.compile(optimizer=keras.optimizers.Adam(1e-3), loss='mse')
    es = keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True, monitor='val_loss')

    hist = model.fit(
        Xtr_seq, ytr_seq,
        validation_split=0.2,
        epochs=50,
        batch_size=64,
        callbacks=[es],
        verbose=0
    )
    y_pred_lstm = model.predict(Xte_seq, verbose=0).ravel()
else:
    y_pred_lstm = None

print("LSTM ready" if y_pred_lstm is not None else "Skipped LSTM (TensorFlow not installed)")

## 7) Ensembling
Blend the two best ML models (and LSTM if available).

In [None]:
preds = [y_pred_hgb, y_pred_rf]
labels = ['HistGB','RandomForest']

if isinstance(y_pred_lstm, np.ndarray):
    preds.append(y_pred_lstm)
    labels.append('LSTM')

# Equal weights by default; you can optimize weights on a validation fold
pred_ens = np.mean(np.vstack(preds), axis=0)
ens_metrics = evaluate(y_test.values[-len(pred_ens):], pred_ens, 'Ensemble')
pd.DataFrame([ens_metrics])

## 8) Backtest (Toy)

In [None]:
# Simple strategy: go long if predicted next-day return > threshold, otherwise stay flat.
thr = 0.0005  # 5 bps
pred_series = pd.Series(pred_ens, index=y_test.index[-len(pred_ens):], name='pred_ret')
true_series = y_test[-len(pred_ens):]

signal = (pred_series > thr).astype(int)
strategy_ret = signal.shift(1).fillna(0) * true_series  # enter at close, realize next-day
cum_pnl = (1 + strategy_ret).cumprod()

print('Directional Accuracy:', ((true_series > 0) == (pred_series > 0)).mean())
print('Annualized Return (approx):', (cum_pnl.iloc[-1] ** (252/len(cum_pnl)) - 1) if len(cum_pnl) > 0 else np.nan)

plt.figure(figsize=(10,4))
cum_pnl.plot()
plt.title('Cumulative Strategy PnL (toy)')
plt.xlabel('Time')
plt.ylabel('Equity Curve')
plt.show()

## 9) Save Artifacts

In [None]:
# Save models and scalers
joblib.dump(best_hgb, 'model_histgb.joblib')
joblib.dump(best_rf, 'model_rf.joblib')

# Save last predictions
out = pd.DataFrame({
    'date': test_df['Date'][-len(pred_ens):].values if 'Date' in test_df.columns else np.arange(len(pred_ens)),
    'y_true': y_test[-len(pred_ens):].values,
    'y_pred_ensemble': pred_ens
})
out.to_csv('predictions.csv', index=False)
print('Saved: model_histgb.joblib, model_rf.joblib, predictions.csv')

## 10) How to Use on Your Data
1. Put your OHLCV data in a CSV with columns: `Date, Open, High, Low, Close, Volume` (optional `Adj Close`).
2. Set `CSV_PATH = 'your_file.csv'` and `USE_YFINANCE = False`.
3. Run all cells. The notebook will feature‑engineer, CV‑tune, evaluate, ensemble, and backtest.
4. Inspect `predictions.csv` and the printed metrics.

**Notes for best accuracy**:
- Stick to **returns** as the target (more stationary) and predict shorter horizons (1–5 days).
- Keep **walk‑forward** validation to avoid look‑ahead bias.
- Regularize heavily (larger `min_samples_leaf`, smaller learning rate) to reduce overfit.
- Try adding **calendar** features (weekday, month), macro factors, and **market regime** indicators (VIX, rates) if available.
- If using LSTM, standardize inputs and experiment with input windows (e.g., 20–60 days).