## Logistic Regression Model Utilizing Walk-Forward Validation

#### Author: Zachary Wright, CFA, FRM

This is a simple exercise project to find a solution the data leakage problem when making predictions on test sample time-series data by ordering the samples by time.

In [7]:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

#Download historical stock data for NVidia
ticker = "NVDA"
data = yf.download(ticker, start="2020-01-01", end="2024-01-01")

#Feature Engineering with closing price data
data["MA5"] = data["Close"].rolling(window=5).mean()
data["MA10"] = data["Close"].rolling(window=10).mean()
data["Momentum"] = data["Close"] - data["Close"].shift(4)
data["Daily Return"] = data["Close"].pct_change()

#Target Binary Variable (1 if next day's close price is higher, 0 otherwise)
data["Target"] = (data["Close"].shift(-1) > data["Close"]).astype(int)

#Drop NaN or null values
data.dropna(inplace=True)

#Define Features
features = ["MA5", "MA10", "Momentum", "Daily Return"]
X = data[features]
y = data["Target"]

#Walk-Forward Validation
split_point = int(len(data) * 0.8)  #Use 80% for training, 20% for testing (in time order)
X_train, X_test = X.iloc[:split_point], X.iloc[split_point:]
y_train, y_test = y.iloc[:split_point], y.iloc[split_point:]

#Standardize Features based on Training Data Only
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#Train Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

#Predictions on Future Data
y_pred = model.predict(X_test)

#Evaluation
print(f"\nAccuracy on future unseen data: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

[*********************100%***********************]  1 of 1 completed


Accuracy on future unseen data: 0.46
              precision    recall  f1-score   support

           0       0.45      1.00      0.62        90
           1       1.00      0.01      0.02       110

    accuracy                           0.46       200
   macro avg       0.73      0.50      0.32       200
weighted avg       0.75      0.46      0.29       200




