Walk-Forward Validation / Rolling Forecast Origin

General Hierarchy of Hyperparameter Importance (often, but context matters):

1. eta (learning_rate) & n_estimators: These are highly impactful and strongly linked. Lower eta generally requires higher n_estimators for convergence. Finding a good balance is crucial for performance.
2. max_depth: Controls the complexity of individual trees. Very important for preventing overfitting/underfitting.
3. subsample & colsample_bytree (or _bylevel, _bynode): Introduce stochasticity, reducing variance and preventing overfitting. Often provide good gains.
4. min_child_weight: Acts as a regularizer, particularly useful on imbalanced datasets to prevent trees from focusing too much on single outlier instances of the minority class.
5. gamma (min_split_loss): Another regularization parameter, controlling whether a split is made based on loss reduction.
6. Regularization (lambda, alpha): L2 and L1 regularization help prevent overfitting by shrinking weights. They are important, but often fine-tuning them yields smaller gains compared to getting the core structure (max_depth, eta/n_estimators) right, unless you have severe overfitting issues.

Here are typical considerations for fixed values:

1. gamma (min_split_loss):
Purpose: Controls the minimum loss reduction needed to make a split. Higher values lead to fewer splits (more conservative).
Default: 0 (allows any split that reduces loss, even slightly).
Typical Fixed Value: 0. Sticking with the default is very common if you are not tuning it. It allows the tree to grow based purely on loss improvement.
Slightly More Conservative Fixed Value: Sometimes people might use a very small value like 0.1 or 0.2 if they want to introduce a tiny barrier against splitting on minimal gains, but usually, it's kept at 0 unless tuned.
Range Idea: Think 0 to ~1 as the area of most impact; values much larger quickly prune the tree heavily.

2. lambda (reg_lambda / L2 Regularization):
Purpose: Penalizes large weights using the square of their magnitude. Encourages smaller, more distributed weights, good for preventing overfitting when features might be correlated.
Default: 1. This already provides a moderate amount of L2 regularization.
Typical Fixed Value: 1. The default is a very sensible starting point and often used as the fixed value.
Alternative Moderate Fixed Values: Values between 0.5 and 2 are generally considered moderate L2 regularization. You could fix it at 1.5 (like you had) or 1 without drastically changing the regularization strength compared to the default.
Range Idea: 0 means no L2. Values from 0.1 to 10 cover a wide range from weak to strong regularization.

3. alpha (reg_alpha / L1 Regularization):
Purpose: Penalizes large weights using their absolute magnitude. Can drive feature weights to exactly zero, performing implicit feature selection.
Default: 0 (no L1 regularization).
Typical Fixed Value: 0. If you don't have a specific reason to encourage sparsity or perform L1 feature selection, leaving it at the default is standard practice.
Slightly More Conservative Fixed Value: If you suspect some L1 might help a little without tuning, a very small value like 0.01 or 0.1 might be chosen. But often, if L1 is desired, it's added to the tuning grid.
Range Idea: Similar to lambda, 0 means no L1, and values from 0.1 to 10 cover weak to strong L1 effects.

Summary for Fixed Values (when not tuning):

Most Common / Standard Defaults:

gamma: 0
lambda: 1
alpha: 0

Slightly Modified Defaults (if desired):

gamma: 0 or 0.1
lambda: 1 or 1.5
alpha: 0 or 0.1

In [4]:
import os
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, log_loss
from sklearn.model_selection import ParameterGrid
from sklearn.exceptions import UndefinedMetricWarning
import time
import warnings # Make sure this is imported

# --- Suppress Warnings ---
# Suppress UndefinedMetricWarning (when precision/recall/F1 is 0)
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)
# Suppress potential future warnings from xgboost itself
warnings.filterwarnings("ignore", category=FutureWarning, module='xgboost')
# Suppress the specific 'use_label_encoder' UserWarning from xgboost's sklearn wrapper
warnings.filterwarnings("ignore", category=UserWarning, module='xgboost.sklearn')

# --- Constants and Configuration ---
DB_CONNECTION = os.getenv('DB_CONNECTION', 'postgresql+psycopg2://postgres:bubZ$tep433@localhost:5432/ohlcv_db')
TARGET_THRESHOLD_PCT = 0.1 # <<< Target threshold percentage variable
SQL_LIMIT = None # Set to an integer (e.g., 5000) for testing, None for full run

# Walk-forward params
TRAIN_WINDOW_HOURS = 3
TEST_WINDOW_HOURS = 1
STEP_MINUTES = 2

# Convert time durations to rows (assuming 1 row per minute)
TRAIN_WINDOW_ROWS = TRAIN_WINDOW_HOURS * 60
TEST_WINDOW_ROWS = TEST_WINDOW_HOURS * 60
STEP_ROWS = STEP_MINUTES

# --- Simple Grid Search Configuration ---
# Define a *small* grid of parameters to search within each window
# Keep this very limited to control compute time!
INNER_CV_PARAM_GRID = {
    'max_depth': [3, 5],          # Test a shallow and slightly deeper tree
    'n_estimators': [150, 350],   # Test fewer/more trees
    'eta': [0.04, 0.12],          # Optionally test learning rate, but adds more combos
    'subsample': [0.8, 1.0],
    'alpha': [0.3, 0.7],
    'min_child_weight': [2, 5]
}
# Size of the validation set taken from the *end* of the training window (%)
INNER_CV_VALIDATION_PCT = 0.20 # Use last 20% of training data for quick validation

# Fixed XGBoost parameters (not tuned in the inner loop)
XGB_FIXED_PARAMS = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',    # Often used for XGBoost internal evaluation
    # Fixed learning rate (unless added to grid)
    #'subsample': 0.8,
    'colsample_bytree': 0.8,
    #'min_child_weight': 1,
    'gamma': 0.05,
    'lambda': 3,               # L2 regularization
    #'alpha': 0.8,                # L1 regularization
    'random_state': 42,
    'n_jobs': -1,
    # Optional GPU Acceleration:
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor'
}


# --- 1. Data Loading and Initial Setup ---
print("--- 1. Data Loading ---")
try:
    engine = create_engine(DB_CONNECTION)
except Exception as e:
    print(f"Error creating database engine: {e}")
    exit()

# Define the feature columns to fetch
feature_columns_db = [ # Same list as before...
    'timestamp', 'symbol', 'open', 'high', 'low', 'close', 'volume_btc', 'volume_usd',
    'garman_klass_12h', 'price_range_pct', 'oc_change_pct', 'parkinson_3h',
    'ma_3h', 'rolling_std_3h', 'lag_3h_price_return', 'lag_6h_price_return',
    'lag_12h_price_return', 'lag_24h_price_return', 'lag_48h_price_return',
    'lag_72h_price_return', 'lag_168h_price_return', 'volume_return_1h',
    'lag_3h_volume_return', 'lag_6h_volume_return', 'lag_12h_volume_return',
    'lag_24h_volume_return', 'ma_6h', 'ma_12h', 'ma_24h', 'ma_48h', 'ma_72h',
    'ma_168h', 'rolling_std_6h', 'rolling_std_12h', 'rolling_std_24h',
    'rolling_std_48h', 'rolling_std_72h', 'rolling_std_168h', 'atr_14h',
    'atr_24h', 'atr_48h', 'close_div_ma_24h', 'close_div_ma_48h',
    'close_div_ma_168h', 'ma12_div_ma48', 'ma24_div_ma168', 'std12_div_std72',
    'volume_btc_x_range', 'rolling_std_3h_sq', 'price_return_1h_sq',
    'rolling_std_12h_sqrt',
    'hour_0', 'hour_1', 'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6',
    'hour_7', 'hour_8', 'hour_9', 'hour_10', 'hour_11', 'hour_12',
    'hour_13', 'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18',
    'hour_19', 'hour_20', 'hour_21', 'hour_22', 'hour_23',
    'day_0', 'day_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6',
    'rsi_14h', 'macd', 'macd_signal', 'macd_hist', 'volume_ma_12h',
    'volume_ma_24h', 'volume_ma_72h', 'volume_ma_168h', 'volume_div_ma_24h',
    'obv', 'obv_ma_24h', 'atr_14_div_atr_48', 'stoch_k', 'stoch_d',
    'z_score_24h', 'rolling_skew_24h', 'rolling_kurt_24h',
    'adx_14h', 'plus_di_14h', 'minus_di_14h', 'ad_line', 'cmf_20h', 'cci_20h',
    'bband_width_20h', 'bband_pctb_20h', 'close_pos_in_range',
    'vwap_24h', 'log_return_1h', 'stoch_rsi_k', 'stoch_rsi_d'
]
columns_str = ', '.join(f'"{col}"' for col in feature_columns_db)
query = f"SELECT {columns_str} FROM public.advanced_features ORDER BY timestamp ASC"
if SQL_LIMIT:
    query += f" LIMIT {SQL_LIMIT}"

try:
    print(f"Fetching data (limit={SQL_LIMIT})...")
    fetch_start_time = time.time()
    df_raw = pd.read_sql(query, engine)
    fetch_end_time = time.time()
    print(f"Data loaded. Shape: {df_raw.shape}. Time: {fetch_end_time - fetch_start_time:.2f}s")
    df_raw['timestamp'] = pd.to_datetime(df_raw['timestamp'])
    print("\nRaw Dataset Information:")
    if not df_raw.empty:
        print(f"Time range: {df_raw['timestamp'].min()} to {df_raw['timestamp'].max()}")
        print(f"Symbols: {df_raw['symbol'].unique()}")
    else:
        print("DataFrame is empty after fetching.")
        exit()
except Exception as e:
    print(f"Error loading data: {e}")
    exit()

# --- 2. Data Cleaning ---
print("\n--- 2. Data Cleaning ---")
numeric_columns = df_raw.select_dtypes(include=np.number).columns
missing_before = df_raw[numeric_columns].isna().sum().sum()
infinite_before = np.isinf(df_raw[numeric_columns].values).sum()
print(f"Missing/Infinite before: {missing_before} / {infinite_before}")
df = df_raw.replace([np.inf, -np.inf], np.nan).dropna(subset=numeric_columns)
missing_after = df[numeric_columns].isna().sum().sum()
infinite_after = np.isinf(df[numeric_columns].values).sum()
print(f"Missing/Infinite after: {missing_after} / {infinite_after}")
print(f"Rows removed: {len(df_raw) - len(df)} ({((len(df_raw) - len(df)) / len(df_raw) * 100):.2f}%)")
if df.empty: print("DataFrame empty after cleaning."); exit()

# --- 3. Feature Engineering ---
print("\n--- 3. Feature Engineering ---")
# Dummy encode symbols
print("Creating dummy variables for symbols (BTC, SOL)...")
df['BTC'] = (df['symbol'] == 'BTC').astype(int)
df['SOL'] = (df['symbol'] == 'SOL').astype(int)

# Create binary target
print(f"Creating binary target 'target' (1-hour return >= {TARGET_THRESHOLD_PCT}%)...")
# Ensure shift happens correctly within each group
df = df.sort_values(['symbol', 'timestamp']) # Sort for reliable shift within groups
df['future_price'] = df.groupby('symbol')['close'].shift(-TEST_WINDOW_ROWS) # Shift by test window size (60 rows)
df['price_return_1h'] = (df['future_price'] - df['close']) / df['close'] * 100
df['target'] = (df['price_return_1h'] >= TARGET_THRESHOLD_PCT).astype(int)
df = df.drop(['future_price', 'price_return_1h'], axis=1)
print(f"NaN values in target before drop: {df['target'].isna().sum()}")
df = df.dropna(subset=['target'])
print(f"Rows after removing NaN targets: {len(df)}")
if df.empty: print("DataFrame empty after target creation."); exit()

# Display target distribution
target_counts = df['target'].value_counts()
print("\nTarget variable distribution:")
print(f"  0 (< {TARGET_THRESHOLD_PCT}% return): {target_counts.get(0, 0)} ({target_counts.get(0, 0) / len(df) * 100:.2f}%)")
print(f"  1 (>= {TARGET_THRESHOLD_PCT}% return): {target_counts.get(1, 0)} ({target_counts.get(1, 0) / len(df) * 100:.2f}%)")

# --- 4. Final Preparation for Walk-Forward ---
print("\n--- 4. Final Preparation ---")
df = df.sort_values('timestamp') # Sort chronologically overall
df = df.reset_index(drop=True) # Reset index for iloc
print(f"Final DataFrame shape for backtesting: {df.shape}")

TARGET_COLUMN = 'target'
EXCLUDE_COLS = [TARGET_COLUMN, 'symbol', 'timestamp']
FEATURE_COLUMNS = [col for col in df.columns if col not in EXCLUDE_COLS]
print(f"\n--- Feature Selection ---")
print(f"Target Column: '{TARGET_COLUMN}'")
print(f"Number of features selected: {len(FEATURE_COLUMNS)}")
print(f"  - 'BTC' included: {'BTC' in FEATURE_COLUMNS}")
print(f"  - 'SOL' included: {'SOL' in FEATURE_COLUMNS}")

# --- 5. Walk-Forward Validation Implementation ---
print("\n--- 5. Starting Walk-Forward Validation ---")
all_metrics = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
iteration_count = 0
n_rows_total = len(df)
current_train_start_idx = 0
total_iterations_estimate = max(0, (n_rows_total - TRAIN_WINDOW_ROWS - TEST_WINDOW_ROWS) // STEP_ROWS)
print(f"Total rows: {n_rows_total}, Train: {TRAIN_WINDOW_ROWS}, Test: {TEST_WINDOW_ROWS}, Step: {STEP_ROWS}")
print(f"Estimated iterations: {total_iterations_estimate}")
print(f"Inner CV Grid: {INNER_CV_PARAM_GRID}")
print("-" * 30)
start_loop_time = time.time()

while True:
    # --- Define Window Boundaries ---
    train_end_idx = current_train_start_idx + TRAIN_WINDOW_ROWS
    test_start_idx = train_end_idx
    test_end_idx = test_start_idx + TEST_WINDOW_ROWS

    # --- Boundary Checks ---
    if test_end_idx > n_rows_total:
        print(f"\nStopping: Test window end index ({test_end_idx}) > total rows ({n_rows_total}).")
        break

    # --- Data Slicing ---
    train_df = df.iloc[current_train_start_idx : train_end_idx]
    test_df = df.iloc[test_start_idx : test_end_idx]

    # --- Data Validity Checks ---
    min_train_samples = 50
    min_test_samples = 5
    if len(train_df) < min_train_samples or len(test_df) < min_test_samples:
         print(f"Skipping iter {iteration_count + 1}: Insufficient data train ({len(train_df)}) or test ({len(test_df)}).")
         current_train_start_idx += STEP_ROWS
         continue

    X_train_full = train_df[FEATURE_COLUMNS]
    y_train_full = train_df[TARGET_COLUMN]
    X_test = test_df[FEATURE_COLUMNS]
    y_test = test_df[TARGET_COLUMN]

    # --- Check for Single Class in Full Train/Test ---
    train_counts = y_train_full.value_counts()
    test_counts = y_test.value_counts()
    if len(train_counts) < 2:
        print(f"Skipping iter {iteration_count + 1}: Full training data has only one class ({train_counts.index.tolist()}).")
        current_train_start_idx += STEP_ROWS
        continue
    if len(test_counts) < 2:
         print(f"Warning iter {iteration_count + 1}: Test data has only one class ({test_counts.index.tolist()}). Metrics might be affected.")

    # --- Calculate scale_pos_weight (based on full training data) ---
    neg_count = train_counts.get(0, 0)
    pos_count = train_counts.get(1, 0)
    scale_pos_weight_val = neg_count / pos_count if pos_count > 0 else 1.0

    # --- Inner Loop: Simple Grid Search ---
    iter_start_time = time.time()
    best_val_score = -np.inf # Initialize for maximization (e.g., F1) or np.inf for minimization (e.g., logloss)
    best_params_iter = None

    # Split training data for inner validation
    val_size = int(len(X_train_full) * INNER_CV_VALIDATION_PCT)
    if val_size < min_test_samples: # Ensure validation set is usable
        print(f"Warning iter {iteration_count + 1}: Validation set too small ({val_size}), skipping inner CV.")
        best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0] # Use first combo as default
    else:
        X_train_sub = X_train_full[:-val_size]
        y_train_sub = y_train_full[:-val_size]
        X_val = X_train_full[-val_size:]
        y_val = y_train_full[-val_size:]

        # Check if validation sub-set has both classes
        if len(y_val.unique()) < 2:
            print(f"Warning iter {iteration_count + 1}: Validation set has only one class, skipping inner CV.")
            best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0] # Use first combo
        else:
            # Iterate through the small grid
            for params_cv in ParameterGrid(INNER_CV_PARAM_GRID):
                try:
                    # Combine fixed and CV params
                    current_cv_params = {**XGB_FIXED_PARAMS, **params_cv}

                    model_cv = XGBClassifier(**current_cv_params,
                                             scale_pos_weight=scale_pos_weight_val, # Use weight calculated on full train set
                                             use_label_encoder=False)

                    model_cv.fit(X_train_sub, y_train_sub,
                                 eval_set=[(X_val, y_val)], # Optional: use early stopping
                                 early_stopping_rounds=10, # Stop if val score doesn't improve for 10 rounds
                                 verbose=False) # Suppress CV training verbosity

                    y_pred_val = model_cv.predict(X_val)
                    # Use F1 score on validation set - good for imbalance
                    val_score = f1_score(y_val, y_pred_val, average='binary', pos_label=1, zero_division=0)

                    if val_score > best_val_score:
                        best_val_score = val_score
                        best_params_iter = params_cv

                except Exception as e_cv:
                     print(f"Error during inner CV iter {iteration_count + 1} with params {params_cv}: {e_cv}")
                     # If error occurs, maybe default to first param set or skip CV for this iteration
                     if best_params_iter is None:
                         best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0]


    # Handle case where inner CV failed entirely or was skipped
    if best_params_iter is None:
        print(f"Warning iter {iteration_count + 1}: Inner CV did not find best params, using default.")
        best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0] # Default to first combo

    # --- Train Final Model for this Iteration ---
    # Combine fixed and best CV params found
    final_iter_params = {**XGB_FIXED_PARAMS, **best_params_iter}

    # Display progress less frequently
    if (iteration_count == 0) or ((iteration_count + 1) % 50 == 0) or (test_end_idx > n_rows_total):
         print(f"\nIter {iteration_count + 1}/{total_iterations_estimate}: Train[{current_train_start_idx}:{train_end_idx-1}], Test[{test_start_idx}:{test_end_idx-1}], Best Params: {best_params_iter}")

    current_model = XGBClassifier(**final_iter_params,
                                  scale_pos_weight=scale_pos_weight_val,
                                  use_label_encoder=False)

    # Train on the FULL training set for this window
    current_model.fit(X_train_full, y_train_full, verbose=False)

    # --- Prediction and Evaluation ---
    y_pred = current_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='binary', pos_label=1, zero_division=0)
    recall = recall_score(y_test, y_pred, average='binary', pos_label=1, zero_division=0)
    f1 = f1_score(y_test, y_pred, average='binary', pos_label=1, zero_division=0)

    all_metrics['accuracy'].append(accuracy)
    all_metrics['precision'].append(precision)
    all_metrics['recall'].append(recall)
    all_metrics['f1'].append(f1)

    iteration_count += 1
    iter_end_time = time.time()
    # Optional: print metrics per iteration
    # print(f"  Metrics: Acc={accuracy:.4f}, Prc={precision:.4f}, Rec={recall:.4f}, F1={f1:.4f} (took {iter_end_time - iter_start_time:.2f}s)")

    # --- Move to Next Window ---
    current_train_start_idx += STEP_ROWS

end_loop_time = time.time()
print("-" * 30)
loop_duration_minutes = (end_loop_time - start_loop_time) / 60
print(f"Walk-Forward Validation finished in {end_loop_time - start_loop_time:.2f} seconds ({loop_duration_minutes:.2f} minutes).")

# --- 6. Aggregate and Display Results ---
print("\n--- 6. Final Results ---")
if iteration_count > 0:
    avg_accuracy = np.mean(all_metrics['accuracy'])
    avg_precision = np.mean(all_metrics['precision'])
    avg_recall = np.mean(all_metrics['recall'])
    avg_f1 = np.mean(all_metrics['f1'])

    print("\n--- Average Walk-Forward Validation Results (XGBoost with Inner CV) ---")
    print(f"Total Folds / Iterations Evaluated: {iteration_count}")
    print(f"Target Threshold: {TARGET_THRESHOLD_PCT}%")
    print(f"Average Accuracy:  {avg_accuracy:.4f}")
    print(f"Average Precision: {avg_precision:.4f} (for class 1: return >= {TARGET_THRESHOLD_PCT}%)")
    print(f"Average Recall:    {avg_recall:.4f} (for class 1: return >= {TARGET_THRESHOLD_PCT}%)")
    print(f"Average F1-Score:  {avg_f1:.4f} (for class 1: return >= {TARGET_THRESHOLD_PCT}%)")

    std_accuracy = np.std(all_metrics['accuracy'])
    std_precision = np.std(all_metrics['precision'])
    std_recall = np.std(all_metrics['recall'])
    std_f1 = np.std(all_metrics['f1'])
    print("\n--- Standard Deviation of Metrics Across Folds ---")
    print(f"Std Dev Accuracy:  {std_accuracy:.4f}")
    print(f"Std Dev Precision: {std_precision:.4f}")
    print(f"Std Dev Recall:    {std_recall:.4f}")
    print(f"Std Dev F1-Score:  {std_f1:.4f}")
else:
    print("\nNo iterations were successfully completed.")

print("\nScript finished.")

--- 1. Data Loading ---
Fetching data (limit=None)...
Data loaded. Shape: (2728, 112). Time: 0.22s

Raw Dataset Information:
Time range: 2025-04-04 20:17:00 to 2025-04-05 19:00:00
Symbols: ['BTC' 'SOL']

--- 2. Data Cleaning ---
Missing/Infinite before: 1824 / 0
Missing/Infinite after: 0 / 0
Rows removed: 336 (12.32%)

--- 3. Feature Engineering ---
Creating dummy variables for symbols (BTC, SOL)...
Creating binary target 'target' (1-hour return >= 0.1%)...
NaN values in target before drop: 0
Rows after removing NaN targets: 2392

Target variable distribution:
  0 (< 0.1% return): 1712 (71.57%)
  1 (>= 0.1% return): 680 (28.43%)

--- 4. Final Preparation ---
Final DataFrame shape for backtesting: (2392, 115)

--- Feature Selection ---
Target Column: 'target'
Number of features selected: 112
  - 'BTC' included: True
  - 'SOL' included: True

--- 5. Starting Walk-Forward Validation ---
Total rows: 2392, Train: 180, Test: 60, Step: 2
Estimated iterations: 1076
Inner CV Grid: {'max_depth':

Computational Overhead: This adds a substantial calculation step before every single training iteration. At a 5-minute retraining frequency, calculating weighted importances across 100 past results, selecting features, and subsetting the data adds significant time and complexity to each cycle. This might make keeping up with the 5-minute interval difficult.