Walk-Forward Validation / Rolling Forecast Origin

General Hierarchy of Hyperparameter Importance (often, but context matters):

1. eta (learning_rate) & n_estimators: These are highly impactful and strongly linked. Lower eta generally requires higher n_estimators for convergence. Finding a good balance is crucial for performance.
2. max_depth: Controls the complexity of individual trees. Very important for preventing overfitting/underfitting.
3. subsample & colsample_bytree (or _bylevel, _bynode): Introduce stochasticity, reducing variance and preventing overfitting. Often provide good gains.
4. min_child_weight: Acts as a regularizer, particularly useful on imbalanced datasets to prevent trees from focusing too much on single outlier instances of the minority class.
5. gamma (min_split_loss): Another regularization parameter, controlling whether a split is made based on loss reduction.
6. Regularization (lambda, alpha): L2 and L1 regularization help prevent overfitting by shrinking weights. They are important, but often fine-tuning them yields smaller gains compared to getting the core structure (max_depth, eta/n_estimators) right, unless you have severe overfitting issues.

Here are typical considerations for fixed values:

1. gamma (min_split_loss):
Purpose: Controls the minimum loss reduction needed to make a split. Higher values lead to fewer splits (more conservative).
Default: 0 (allows any split that reduces loss, even slightly).
Typical Fixed Value: 0. Sticking with the default is very common if you are not tuning it. It allows the tree to grow based purely on loss improvement.
Slightly More Conservative Fixed Value: Sometimes people might use a very small value like 0.1 or 0.2 if they want to introduce a tiny barrier against splitting on minimal gains, but usually, it's kept at 0 unless tuned.
Range Idea: Think 0 to ~1 as the area of most impact; values much larger quickly prune the tree heavily.

2. lambda (reg_lambda / L2 Regularization):
Purpose: Penalizes large weights using the square of their magnitude. Encourages smaller, more distributed weights, good for preventing overfitting when features might be correlated.
Default: 1. This already provides a moderate amount of L2 regularization.
Typical Fixed Value: 1. The default is a very sensible starting point and often used as the fixed value.
Alternative Moderate Fixed Values: Values between 0.5 and 2 are generally considered moderate L2 regularization. You could fix it at 1.5 (like you had) or 1 without drastically changing the regularization strength compared to the default.
Range Idea: 0 means no L2. Values from 0.1 to 10 cover a wide range from weak to strong regularization.

3. alpha (reg_alpha / L1 Regularization):
Purpose: Penalizes large weights using their absolute magnitude. Can drive feature weights to exactly zero, performing implicit feature selection.
Default: 0 (no L1 regularization).
Typical Fixed Value: 0. If you don't have a specific reason to encourage sparsity or perform L1 feature selection, leaving it at the default is standard practice.
Slightly More Conservative Fixed Value: If you suspect some L1 might help a little without tuning, a very small value like 0.01 or 0.1 might be chosen. But often, if L1 is desired, it's added to the tuning grid.
Range Idea: Similar to lambda, 0 means no L1, and values from 0.1 to 10 cover weak to strong L1 effects.

Summary for Fixed Values (when not tuning):

Most Common / Standard Defaults:

gamma: 0
lambda: 1
alpha: 0

Slightly Modified Defaults (if desired):

gamma: 0 or 0.1
lambda: 1 or 1.5
alpha: 0 or 0.1

In [None]:
# Combined Script: Load CSV -> Feature Engineering -> Rolling Origin XGB Modeling

import pandas as pd
import numpy as np
import time
import os
import warnings
from datetime import datetime
import traceback

# Feature Engineering Imports
import pandas_ta as ta  # Technical indicators

# Modeling Imports
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, log_loss
from sklearn.model_selection import ParameterGrid
from sklearn.exceptions import UndefinedMetricWarning

# --- Suppress Warnings ---
warnings.filterwarnings('ignore', category=UndefinedMetricWarning)
warnings.filterwarnings('ignore', category=FutureWarning, module='xgboost')
warnings.filterwarnings('ignore', category=UserWarning, module='xgboost.sklearn')
warnings.filterwarnings('ignore', category=pd.errors.PerformanceWarning) # Ignore pandas performance warnings
warnings.filterwarnings('ignore') # General suppression for others if needed

# --- Configuration ---

# Data Loading
CSV_FILE_PATH = r'C:\Users\mason\AVP\BTCUSD.csv' # Use raw string for Windows paths
SYMBOL_NAME = 'BTCUSD' # Define the symbol represented in the CSV

# Feature Engineering (No longer needed: REQUIRED_OVERLAP_HOURS)

# Modeling & Walk-Forward
TARGET_THRESHOLD_PCT = 0.1 # Target threshold percentage variable

# Walk-forward params (Assuming HOURLY data in CSV)
TRAIN_WINDOW_HOURS = 175 # Example: ~1 week training
TEST_WINDOW_HOURS = 24   # Example: Predict next 24 hours
STEP_HOURS = 12          # Example: Step forward 12 hours each iteration

# Convert time durations to rows (1 row = 1 hour)
TRAIN_WINDOW_ROWS = TRAIN_WINDOW_HOURS
TEST_WINDOW_ROWS = TEST_WINDOW_HOURS
STEP_ROWS = STEP_HOURS

# Inner Cross-Validation Grid Search Configuration (Keep small)
INNER_CV_PARAM_GRID = {
    'max_depth': [3, 4],
    'n_estimators': [100, 200],
    'eta': [0.05, 0.1],
}
INNER_CV_VALIDATION_PCT = 0.20 # Use last 20% of training data for quick validation

# Fixed XGBoost parameters
XGB_FIXED_PARAMS = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'colsample_bytree': 0.8,
    'min_child_weight': 2,
    'subsample': 0.8,
    'gamma': 0.05,
    'alpha': 0.3,
    'lambda': 1.7, # L2 regularization
    'random_state': 42,
    'n_jobs': -1,
    # Uncomment for GPU if available and configured
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor'
}

# --- Feature Engineering Function (Adapted from Script 1) ---

def calculate_advanced_features(df, symbol):
    """Calculate all the advanced features from OHLCV data"""
    print("Starting feature calculation...")
    if df is None or len(df) < 3:
        print("Input DataFrame is too small or None.")
        return pd.DataFrame()

    df = df.copy()
    df['symbol'] = symbol # Add symbol column needed by some logic/output structure

    # Ensure timestamp is datetime and set as index temporarily
    if 'timestamp' not in df.columns:
        print("Error: 'timestamp' column not found.")
        return pd.DataFrame()
    try:
        df['timestamp'] = pd.to_datetime(df['timestamp'])
    except Exception as e:
        print(f"Error converting timestamp column to datetime: {e}")
        return pd.DataFrame()

    df = df.sort_values('timestamp')
    if df['timestamp'].isnull().any():
         print("Warning: NaN values found in timestamp after conversion. Dropping rows.")
         df = df.dropna(subset=['timestamp'])

    df = df.set_index('timestamp', drop=False)

    # Rename volume columns if they exist (handle potential missing cols)
    if 'volumefrom' in df.columns:
         df['volume_btc'] = df['volumefrom']
    elif 'Volume BTC' in df.columns:
         df['volume_btc'] = df['Volume BTC']
    else:
         print("Warning: 'volumefrom' or 'Volume BTC' column not found. Volume features may fail.")
         df['volume_btc'] = 0 # Add dummy column

    if 'volumeto' in df.columns:
         df['volume_usd'] = df['volumeto']
    elif 'Volume USD' in df.columns:
         df['volume_usd'] = df['Volume USD']
    else:
         print("Warning: 'volumeto' or 'Volume USD' column not found.")
         df['volume_usd'] = 0 # Add dummy column

    # Basic Checks
    required_cols = ['open', 'high', 'low', 'close', 'volume_btc']
    for col in required_cols:
        if col not in df.columns:
            print(f"Error: Required column '{col}' not found in DataFrame.")
            return pd.DataFrame()
        # Ensure numeric types
        df[col] = pd.to_numeric(df[col], errors='coerce')

    if df[required_cols].isnull().any().any():
        print(f"Warning: NaNs found in required OHLCV columns. Rows before drop: {len(df)}")
        df = df.dropna(subset=required_cols)
        print(f"Rows after dropping NaNs in OHLCV: {len(df)}")
        if df.empty: return pd.DataFrame()

    # Time-Based Features
    hour_of_day = df.index.hour
    day_of_week = df.index.dayofweek
    for hour in range(24):
        df[f'hour_{hour}'] = (hour_of_day == hour).astype(int)
    for day in range(7):
        df[f'day_{day}'] = (day_of_week == day).astype(int)

    # Basic Price & Volume Changes
    df['price_return_1h'] = df['close'].pct_change()
    df['oc_change_pct'] = (df['close'] - df['open']) / df['open']
    df['price_range_pct'] = (df['high'] - df['low']) / df['low']
    df['volume_return_1h'] = df['volume_btc'].pct_change()

    # Moving Averages and Rolling Std (min_periods set to 2 or full window as needed)
    min_periods_base = 2
    for hours in [3, 6, 12, 24, 48, 72, 168]:
        if len(df) >= hours:
            df[f'ma_{hours}h'] = df['close'].rolling(window=hours, min_periods=min_periods_base).mean()
            df[f'rolling_std_{hours}h'] = df['close'].rolling(window=hours, min_periods=min_periods_base).std()
        else:
            df[f'ma_{hours}h'] = np.nan
            df[f'rolling_std_{hours}h'] = np.nan


    # Lagged Price Returns
    for hours in [3, 6, 12, 24, 48, 72, 168]:
        df[f'lag_{hours}h_price_return'] = df['close'].pct_change(periods=hours)
    # Lagged Volume Returns
    for hours in [3, 6, 12, 24]:
        df[f'lag_{hours}h_volume_return'] = df['volume_btc'].pct_change(periods=hours)

    # Volatility Measures
    log_hl = np.log(df['high'].astype(float) / df['low'].astype(float))**2
    log_co = np.log(df['close'].astype(float) / df['open'].astype(float))**2
    df['garman_klass_12h'] = np.sqrt(0.5 * log_hl.rolling(12, min_periods=min(12, len(df))).sum() - (2*np.log(2)-1) * log_co.rolling(12, min_periods=min(12, len(df))).sum()) if len(df) >= 12 else np.nan
    df['parkinson_3h'] = np.sqrt((1 / (4 * np.log(2))) * (np.log(df['high'].astype(float) / df['low'].astype(float))**2).rolling(3, min_periods=min(3, len(df))).sum()) if len(df) >= 3 else np.nan
    df['tr'] = np.maximum(df['high'] - df['low'],
                          np.maximum(abs(df['high'] - df['close'].shift(1)), abs(df['low'] - df['close'].shift(1))))
    for hours in [14, 24, 48]:
         if len(df) >= hours:
              df[f'atr_{hours}h'] = df['tr'].rolling(hours, min_periods=hours).mean()
         else:
              df[f'atr_{hours}h'] = np.nan

    # Momentum Indicators
    if len(df) >= 14:
        delta = df['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(window=14, min_periods=14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(window=14, min_periods=14).mean()
        # Handle potential division by zero or NaN in loss
        with np.errstate(divide='ignore', invalid='ignore'):
            rs = gain / loss
            df['rsi_14h'] = (100 - (100 / (1 + rs)))
        # Replace inf resulting from loss being 0 (strong uptrend) with 100
        df['rsi_14h'] = df['rsi_14h'].replace([np.inf], 100)
        # Fill initial NaNs or NaNs from zero loss with 50 (neutral)
        df['rsi_14h'] = df['rsi_14h'].fillna(50)
    else:
        df['rsi_14h'] = 50.0 # Assign neutral if not enough data

    # MACD requires sufficient data points for EMAs
    min_ema_period = 26
    if len(df) >= min_ema_period:
        ema_12 = df['close'].ewm(span=12, adjust=False, min_periods=12).mean()
        ema_26 = df['close'].ewm(span=26, adjust=False, min_periods=26).mean()
        df['macd'] = ema_12 - ema_26
        df['macd_signal'] = df['macd'].ewm(span=9, adjust=False, min_periods=9).mean()
        df['macd_hist'] = df['macd'] - df['macd_signal']
    else:
        df['macd'] = np.nan
        df['macd_signal'] = np.nan
        df['macd_hist'] = np.nan

    # Volume-Based Indicators
    for hours in [12, 24, 72, 168]:
        if len(df) >= hours:
            df[f'volume_ma_{hours}h'] = df['volume_btc'].rolling(window=hours, min_periods=min_periods_base).mean()
        else:
            df[f'volume_ma_{hours}h'] = np.nan

    df['volume_div_ma_24h'] = df['volume_btc'] / df['volume_ma_24h'] # Will produce NaN/inf if volume_ma_24h is 0 or NaN

    obv = (np.sign(df['close'].diff()) * df['volume_btc']).fillna(0).cumsum()
    df['obv'] = obv
    if len(df) >= 24:
        df['obv_ma_24h'] = df['obv'].rolling(window=24, min_periods=min_periods_base).mean()
    else:
        df['obv_ma_24h'] = np.nan

    # Ratio Features
    df['close_div_ma_24h'] = df['close'] / df['ma_24h']
    df['close_div_ma_48h'] = df['close'] / df['ma_48h']
    df['close_div_ma_168h'] = df['close'] / df['ma_168h']
    df['ma12_div_ma48'] = df['ma_12h'] / df['ma_48h']
    df['ma24_div_ma168'] = df['ma_24h'] / df['ma_168h']
    df['std12_div_std72'] = df['rolling_std_12h'] / df['rolling_std_72h']
    df['atr_14_div_atr_48'] = df['atr_14h'] / df['atr_48h']

    # Normalized Price Position
    if len(df) >= 14:
        low_14 = df['low'].rolling(window=14, min_periods=14).min()
        high_14 = df['high'].rolling(window=14, min_periods=14).max()
        # Handle division by zero if high == low
        range_14 = high_14 - low_14
        df['stoch_k'] = (100 * (df['close'] - low_14) / range_14).replace(np.inf, 50).fillna(50) # Replace inf, fill NaN
        if len(df) >= 16: # Need 14 for stoch_k + 3 for rolling mean
             df['stoch_d'] = df['stoch_k'].rolling(window=3, min_periods=3).mean()
        else:
             df['stoch_d'] = np.nan
    else:
        df['stoch_k'] = 50.0
        df['stoch_d'] = np.nan

    df['z_score_24h'] = (df['close'] - df['ma_24h']) / df['rolling_std_24h'] # Handle potential division by zero later

    # Higher-Order Statistics
    if len(df) >= 24:
        df['rolling_skew_24h'] = df['price_return_1h'].rolling(window=24, min_periods=24).skew()
        df['rolling_kurt_24h'] = df['price_return_1h'].rolling(window=24, min_periods=24).kurt()
    else:
        df['rolling_skew_24h'] = np.nan
        df['rolling_kurt_24h'] = np.nan


    # Interaction Features and Non-linear Transformations
    df['volume_btc_x_range'] = df['volume_btc'] * df['price_range_pct']
    df['rolling_std_3h_sq'] = df['rolling_std_3h'] ** 2
    df['price_return_1h_sq'] = df['price_return_1h'] ** 2
    df['rolling_std_12h_sqrt'] = np.sqrt(df['rolling_std_12h'].abs()) # Use abs() before sqrt for safety

    # --- Advanced Features using pandas_ta ---
    # Create a temporary df for ta functions that might expect specific column names
    ta_df = df.rename(columns={'volume_btc': 'volume'}, errors='ignore') # Ignore error if 'volume_btc' doesn't exist

    # Check for required columns before calling ta functions
    if all(c in ta_df.columns for c in ['high', 'low', 'close']):
        # ADX/DMI
        adx_df = ta_df.ta.adx(length=14)
        if adx_df is not None and not adx_df.empty:
             df['adx_14h'] = adx_df.get('ADX_14', np.nan)
             df['plus_di_14h'] = adx_df.get('DMP_14', np.nan)
             df['minus_di_14h'] = adx_df.get('DMN_14', np.nan)
        else:
             df['adx_14h'], df['plus_di_14h'], df['minus_di_14h'] = np.nan, np.nan, np.nan

        # Accumulation/Distribution Line (requires volume)
        if 'volume' in ta_df.columns:
             df['ad_line'] = ta_df.ta.ad()
        else: df['ad_line'] = np.nan

        # Chaikin Money Flow (requires volume)
        if 'volume' in ta_df.columns:
             df['cmf_20h'] = ta_df.ta.cmf(length=20)
        else: df['cmf_20h'] = np.nan

        # Commodity Channel Index
        df['cci_20h'] = ta_df.ta.cci(length=20)

        # Bollinger Bands
        bbands_df = ta_df.ta.bbands(length=20, std=2)
        if bbands_df is not None and not bbands_df.empty:
            df['bband_width_20h'] = bbands_df.get('BBB_20_2.0', np.nan) # Bandwidth
            df['bband_pctb_20h'] = bbands_df.get('BBP_20_2.0', np.nan) # %B
        else:
             df['bband_width_20h'], df['bband_pctb_20h'] = np.nan, np.nan
    else:
        print("Warning: Missing high, low, or close columns. Skipping some pandas_ta features.")
        # Add NaN columns explicitly if base cols are missing
        df['adx_14h'], df['plus_di_14h'], df['minus_di_14h'] = np.nan, np.nan, np.nan
        df['ad_line'], df['cmf_20h'], df['cci_20h'] = np.nan, np.nan, np.nan
        df['bband_width_20h'], df['bband_pctb_20h'] = np.nan, np.nan


    # Position in Range (handle division by zero)
    range_hl = df['high'] - df['low']
    df['close_pos_in_range'] = ((df['close'] - df['low']) / range_hl).fillna(0.5) # fillna for 0 range
    df['close_pos_in_range'] = df['close_pos_in_range'].replace([np.inf, -np.inf], 0.5) # Replace inf/-inf just in case

    # --- Additional New Features ---
    # Log Returns
    df['log_return_1h'] = np.log(df['close'] / df['close'].shift(1))

    # VWAP (24h) - ensure proper volume sum and handle potential division by zero
    if 'volume_btc' in df.columns and len(df) >= 24:
        numerator = (df['close'] * df['volume_btc']).rolling(window=24, min_periods=24).sum()
        denominator = df['volume_btc'].rolling(window=24, min_periods=24).sum()
        # Handle division by zero or NaN denominator
        with np.errstate(divide='ignore', invalid='ignore'):
            df['vwap_24h'] = numerator / denominator
        df['vwap_24h'] = df['vwap_24h'].fillna(method='ffill') # Optional: forward fill VWAP NaNs
    else:
        df['vwap_24h'] = np.nan

    # Stochastic RSI using pandas_ta
    if 'close' in ta_df.columns:
        stochrsi_df = ta_df.ta.stochrsi(length=14)
        if stochrsi_df is not None and not stochrsi_df.empty:
            # Pandas-ta column names can vary slightly or include parameters
            k_col = next((col for col in stochrsi_df.columns if 'STOCHRSIk' in col), None)
            d_col = next((col for col in stochrsi_df.columns if 'STOCHRSId' in col), None)
            df['stoch_rsi_k'] = stochrsi_df[k_col] if k_col else np.nan
            df['stoch_rsi_d'] = stochrsi_df[d_col] if d_col else np.nan
        else:
            df['stoch_rsi_k'], df['stoch_rsi_d'] = np.nan, np.nan
    else:
        df['stoch_rsi_k'], df['stoch_rsi_d'] = np.nan, np.nan


    # Cleanup: reset index and replace infinities
    df = df.reset_index(drop=True) # Reset index here
    df = df.replace([np.inf, -np.inf], np.nan) # Replace any new infs

    # Define the list of ALL potential feature columns (plus identifiers/OHLCV)
    # Make sure these match the column names generated above
    all_columns = [
        'timestamp', 'symbol', 'open', 'high', 'low', 'close', 'volume_btc', 'volume_usd',
        'garman_klass_12h', 'price_range_pct', 'oc_change_pct', 'parkinson_3h',
        'ma_3h', 'rolling_std_3h', 'lag_3h_price_return', 'lag_6h_price_return',
        'lag_12h_price_return', 'lag_24h_price_return', 'lag_48h_price_return',
        'lag_72h_price_return', 'lag_168h_price_return', 'volume_return_1h',
        'lag_3h_volume_return', 'lag_6h_volume_return', 'lag_12h_volume_return',
        'lag_24h_volume_return', 'ma_6h', 'ma_12h', 'ma_24h', 'ma_48h', 'ma_72h',
        'ma_168h', 'rolling_std_6h', 'rolling_std_12h', 'rolling_std_24h',
        'rolling_std_48h', 'rolling_std_72h', 'rolling_std_168h', 'atr_14h',
        'atr_24h', 'atr_48h', 'close_div_ma_24h', 'close_div_ma_48h',
        'close_div_ma_168h', 'ma12_div_ma48', 'ma24_div_ma168', 'std12_div_std72',
        'volume_btc_x_range', 'rolling_std_3h_sq', 'price_return_1h_sq',
        'rolling_std_12h_sqrt',
        'hour_0', 'hour_1', 'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6',
        'hour_7', 'hour_8', 'hour_9', 'hour_10', 'hour_11', 'hour_12',
        'hour_13', 'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18',
        'hour_19', 'hour_20', 'hour_21', 'hour_22', 'hour_23',
        'day_0', 'day_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6',
        'rsi_14h', 'macd', 'macd_signal', 'macd_hist', 'volume_ma_12h',
        'volume_ma_24h', 'volume_ma_72h', 'volume_ma_168h', 'volume_div_ma_24h',
        'obv', 'obv_ma_24h', 'atr_14_div_atr_48', 'stoch_k', 'stoch_d',
        'z_score_24h', 'rolling_skew_24h', 'rolling_kurt_24h',
        'adx_14h', 'plus_di_14h', 'minus_di_14h', 'ad_line', 'cmf_20h', 'cci_20h',
        'bband_width_20h', 'bband_pctb_20h', 'close_pos_in_range',
        'vwap_24h', 'log_return_1h', 'stoch_rsi_k', 'stoch_rsi_d'
        # Columns removed: 'price_return_1h' (used intermediate, log_return kept)
    ]

    # Ensure all defined columns exist, add missing ones as NaN
    final_cols_present = []
    for col in all_columns:
        if col in df.columns:
            final_cols_present.append(col)
        # else: # Don't add missing columns, just use what was successfully calculated
        #     print(f"Debug: Column '{col}' not found after calculation, will be excluded.")
        #     # df[col] = np.nan # Optionally add if strict schema is needed

    print(f"Feature calculation finished. Returning {len(df)} rows with {len(final_cols_present)} columns.")
    return df[final_cols_present]


# --- Main Execution Block ---
if __name__ == "__main__":

    print("--- 1. Data Loading from CSV ---")
    try:
        print(f"Loading data from: {CSV_FILE_PATH}")
        # Specify column names based on the CSV header provided
        col_names = ['unix', 'date', 'symbol_csv', 'open', 'high', 'low', 'close', 'Volume BTC', 'Volume USD']
        df_raw = pd.read_csv(CSV_FILE_PATH, header=0, names=col_names) # Read header, but assign names for clarity
        print(f"Raw data loaded. Shape: {df_raw.shape}")

        # --- Basic Preprocessing ---
        # Use 'date' as the timestamp
        df_raw['timestamp'] = pd.to_datetime(df_raw['date'])
        # Drop redundant columns
        df_raw = df_raw.drop(['unix', 'date', 'symbol_csv'], axis=1)
        # Rename volume columns for the feature function
        # The feature function handles renaming these -> volume_btc/usd
        # df_raw = df_raw.rename(columns={'Volume BTC': 'volumefrom', 'Volume USD': 'volumeto'}) # Done inside function now

        # Sort by timestamp initially
        df_raw = df_raw.sort_values('timestamp').reset_index(drop=True)

        print(f"Columns after initial processing: {df_raw.columns.tolist()}")
        if not df_raw.empty:
             print(f"Time range: {df_raw['timestamp'].min()} to {df_raw['timestamp'].max()}")
        else:
             print("DataFrame is empty after loading/initial processing.")
             exit()

    except FileNotFoundError:
        print(f"Error: CSV file not found at {CSV_FILE_PATH}")
        exit()
    except Exception as e:
        print(f"Error loading or processing CSV: {e}")
        traceback.print_exc()
        exit()

    # --- 2. Feature Engineering ---
    print("\n--- 2. Feature Engineering ---")
    feature_calc_start = time.time()
    # Pass the specific symbol name
    df_features = calculate_advanced_features(df_raw, symbol=SYMBOL_NAME)
    feature_calc_end = time.time()

    if df_features.empty:
        print("Feature calculation failed or resulted in empty DataFrame. Exiting.")
        exit()
    print(f"Feature calculation completed in {feature_calc_end - feature_calc_start:.2f} seconds.")
    print(f"Shape after feature calculation: {df_features.shape}")

    # --- 3. Data Cleaning (Post-Features) ---
    print("\n--- 3. Data Cleaning (Features) ---")
    numeric_columns = df_features.select_dtypes(include=np.number).columns
    missing_before = df_features[numeric_columns].isna().sum().sum()
    infinite_before = np.isinf(df_features[numeric_columns].replace([np.inf, -np.inf], np.nan).values).sum() # Count infs properly
    print(f"Missing/Infinite values before cleaning: {missing_before} / {infinite_before}")

    # Replace inf/-inf with NaN first, then drop rows with NaN in numeric columns
    df_features = df_features.replace([np.inf, -np.inf], np.nan)
    # Optional: decide which columns are critical for dropping NaNs
    critical_features = ['close'] + [col for col in df_features.columns if 'ma_' in col or 'rolling_std_' in col or 'return' in col] # Example critical cols
    critical_features = [col for col in critical_features if col in df_features.columns] # Ensure they exist
    rows_before_drop = len(df_features)
    df = df_features.dropna(subset=critical_features) # Drop rows missing critical features

    missing_after = df[numeric_columns].isna().sum().sum()
    infinite_after = np.isinf(df[numeric_columns].values).sum() # Should be 0 now
    print(f"Missing/Infinite values after cleaning: {missing_after} / {infinite_after}")
    print(f"Rows removed due to NaNs in critical features: {rows_before_drop - len(df)} ({((rows_before_drop - len(df)) / rows_before_drop * 100):.2f}%)")

    if df.empty:
        print("DataFrame empty after cleaning NaN features. Exiting.")
        exit()
    print(f"Shape after cleaning: {df.shape}")


    # --- 4. Modeling Target & Feature Preparation ---
    print("\n--- 4. Modeling Target & Feature Preparation ---")

    # Create dummy variables for symbols - Assuming only BTCUSD is present
    print(f"Creating dummy variables for symbol '{SYMBOL_NAME}'...")
    df['BTC'] = 1 # Since the CSV is BTCUSD, BTC is always 1
    df['SOL'] = 0 # Add SOL=0 for consistency if model expects it
    # Note: If features like 'adx_14h' etc., were calculated *per symbol* originally,
    # this approach assumes the calculation is valid for the single symbol BTCUSD.

    # Create binary target
    print(f"Creating binary target 'target' ({TEST_WINDOW_HOURS}-hour return >= {TARGET_THRESHOLD_PCT}%)...")
    # Sort for reliable shift (should already be sorted, but safety check)
    df = df.sort_values(['symbol', 'timestamp']) # Sort by symbol (even if only 1) and time
    df['future_price'] = df.groupby('symbol')['close'].shift(-TEST_WINDOW_ROWS) # Shift by test window size (in hours/rows)
    df['price_return_future'] = (df['future_price'] - df['close']) / df['close'] * 100
    df['target'] = (df['price_return_future'] >= TARGET_THRESHOLD_PCT).astype(int)
    df = df.drop(['future_price', 'price_return_future'], axis=1)

    print(f"NaN values in target before drop: {df['target'].isna().sum()} (Expected due to shift)")
    df = df.dropna(subset=['target']) # Drop rows where target cannot be calculated (end of series)
    print(f"Rows after removing NaN targets: {len(df)}")
    if df.empty:
        print("DataFrame empty after target creation/NaN drop. Exiting.")
        exit()

    # Display target distribution
    target_counts = df['target'].value_counts(normalize=True) * 100
    print("\nTarget variable distribution:")
    print(f"  0 (< {TARGET_THRESHOLD_PCT}% return): {target_counts.get(0, 0):.2f}%")
    print(f"  1 (>= {TARGET_THRESHOLD_PCT}% return): {target_counts.get(1, 0):.2f}%")

    # Final Preparation for Walk-Forward
    print("\n--- Final Preparation for Walk-Forward ---")
    df = df.sort_values('timestamp').reset_index(drop=True) # Ensure chronological order and reset index for iloc
    print(f"Final DataFrame shape for backtesting: {df.shape}")

    # Define Feature Columns for Model Training
    TARGET_COLUMN = 'target'
    # Exclude identifiers, future-leaking info (if any added), and the target itself
    EXCLUDE_COLS = [TARGET_COLUMN, 'symbol', 'timestamp', 'open', 'high', 'low', 'close', 'volume_btc', 'volume_usd'] # Also exclude basic OHLCV? Optional.
    # FEATURE_COLUMNS = [col for col in df.columns if col not in EXCLUDE_COLS and df[col].dtype in [np.int64, np.float64, np.int32, np.float32]]
    # Let's be more explicit based on the generated list + dummies
    potential_features = [
        'garman_klass_12h', 'price_range_pct', 'oc_change_pct', 'parkinson_3h',
        'ma_3h', 'rolling_std_3h', 'lag_3h_price_return', 'lag_6h_price_return',
        'lag_12h_price_return', 'lag_24h_price_return', 'lag_48h_price_return',
        'lag_72h_price_return', 'lag_168h_price_return', 'volume_return_1h',
        'lag_3h_volume_return', 'lag_6h_volume_return', 'lag_12h_volume_return',
        'lag_24h_volume_return', 'ma_6h', 'ma_12h', 'ma_24h', 'ma_48h', 'ma_72h',
        'ma_168h', 'rolling_std_6h', 'rolling_std_12h', 'rolling_std_24h',
        'rolling_std_48h', 'rolling_std_72h', 'rolling_std_168h', 'atr_14h',
        'atr_24h', 'atr_48h', 'close_div_ma_24h', 'close_div_ma_48h',
        'close_div_ma_168h', 'ma12_div_ma48', 'ma24_div_ma168', 'std12_div_std72',
        'volume_btc_x_range', 'rolling_std_3h_sq', 'price_return_1h_sq', # price_return_1h_sq might leak if not careful
        'rolling_std_12h_sqrt',
        'hour_0', 'hour_1', 'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6',
        'hour_7', 'hour_8', 'hour_9', 'hour_10', 'hour_11', 'hour_12',
        'hour_13', 'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18',
        'hour_19', 'hour_20', 'hour_21', 'hour_22', 'hour_23',
        'day_0', 'day_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6',
        'rsi_14h', 'macd', 'macd_signal', 'macd_hist', 'volume_ma_12h',
        'volume_ma_24h', 'volume_ma_72h', 'volume_ma_168h', 'volume_div_ma_24h',
        'obv', 'obv_ma_24h', 'atr_14_div_atr_48', 'stoch_k', 'stoch_d',
        'z_score_24h', 'rolling_skew_24h', 'rolling_kurt_24h',
        'adx_14h', 'plus_di_14h', 'minus_di_14h', 'ad_line', 'cmf_20h', 'cci_20h',
        'bband_width_20h', 'bband_pctb_20h', 'close_pos_in_range',
        'vwap_24h', 'log_return_1h', 'stoch_rsi_k', 'stoch_rsi_d',
        'BTC', 'SOL' # Include dummy variables
    ]
    FEATURE_COLUMNS = [col for col in potential_features if col in df.columns] # Select only those actually present

    # Final check for NaNs/Infs in selected feature columns before modeling
    check_nan_inf = df[FEATURE_COLUMNS].isnull().sum().sum() + np.isinf(df[FEATURE_COLUMNS].values).sum()
    if check_nan_inf > 0:
        print(f"WARNING: {check_nan_inf} NaN/Infinite values detected in FEATURE_COLUMNS just before modeling loop. This might cause errors.")
        # Consider df[FEATURE_COLUMNS] = df[FEATURE_COLUMNS].fillna(df[FEATURE_COLUMNS].median()) # Example imputation
        # Or drop again: df = df.dropna(subset=FEATURE_COLUMNS)
        print("Attempting to fill remaining NaNs with column medians...")
        for col in FEATURE_COLUMNS:
             if df[col].isnull().any():
                  median_val = df[col].median()
                  df[col] = df[col].fillna(median_val)
                  print(f"  Filled NaNs in {col} with median {median_val}")
        check_nan_inf_after = df[FEATURE_COLUMNS].isnull().sum().sum() + np.isinf(df[FEATURE_COLUMNS].values).sum()
        if check_nan_inf_after > 0:
            print("ERROR: NaN/Inf values remain even after fillna. Exiting.")
            exit()
        else:
            print("NaN fill complete.")


    print(f"\n--- Feature Selection ---")
    print(f"Target Column: '{TARGET_COLUMN}'")
    print(f"Number of features selected: {len(FEATURE_COLUMNS)}")
    # print(f"Selected Features: {FEATURE_COLUMNS}") # Uncomment to see list

    # --- 5. Walk-Forward Validation Implementation ---
    print("\n--- 5. Starting Walk-Forward Validation ---")
    all_metrics = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
    iteration_count = 0
    n_rows_total = len(df)
    current_train_start_idx = 0
    total_iterations_estimate = max(0, (n_rows_total - TRAIN_WINDOW_ROWS - TEST_WINDOW_ROWS) // STEP_ROWS +1) # Adjust estimate slightly
    print(f"Total rows: {n_rows_total}, Train Window: {TRAIN_WINDOW_ROWS} ({TRAIN_WINDOW_HOURS}h), Test Window: {TEST_WINDOW_ROWS} ({TEST_WINDOW_HOURS}h), Step: {STEP_ROWS} ({STEP_HOURS}h)")
    print(f"Estimated iterations: {total_iterations_estimate}")
    print(f"Inner CV Grid: {INNER_CV_PARAM_GRID}")
    print("-" * 30)
    start_loop_time = time.time()

    while True:
        # --- Define Window Boundaries ---
        train_end_idx = current_train_start_idx + TRAIN_WINDOW_ROWS
        test_start_idx = train_end_idx
        test_end_idx = test_start_idx + TEST_WINDOW_ROWS

        # --- Boundary Checks ---
        if test_end_idx > n_rows_total:
            print(f"\nStopping: Test window end index ({test_end_idx}) exceeds total rows ({n_rows_total}).")
            break
        if current_train_start_idx >= n_rows_total:
             print(f"\nStopping: Train window start index ({current_train_start_idx}) reached end.")
             break


        # --- Data Slicing ---
        train_df = df.iloc[current_train_start_idx : train_end_idx]
        test_df = df.iloc[test_start_idx : test_end_idx]

        # --- Data Validity Checks ---
        min_train_samples = max(50, int(0.1 * TRAIN_WINDOW_ROWS)) # Need enough samples for train/val split
        min_test_samples = 5
        if len(train_df) < min_train_samples or len(test_df) < min_test_samples:
            print(f"Skipping iter {iteration_count + 1}: Insufficient data train ({len(train_df)}/{min_train_samples}) or test ({len(test_df)}/{min_test_samples}).")
            current_train_start_idx += STEP_ROWS
            continue

        X_train_full = train_df[FEATURE_COLUMNS]
        y_train_full = train_df[TARGET_COLUMN]
        X_test = test_df[FEATURE_COLUMNS]
        y_test = test_df[TARGET_COLUMN]

        # --- Check for Single Class ---
        train_counts = y_train_full.value_counts()
        test_counts = y_test.value_counts()
        if len(train_counts) < 2:
            print(f"Skipping iter {iteration_count + 1}: Full training data has only one class ({train_counts.index.tolist()}). Train indices: {current_train_start_idx}-{train_end_idx-1}")
            current_train_start_idx += STEP_ROWS
            continue
        if len(test_counts) < 2:
            print(f"Warning iter {iteration_count + 1}: Test data has only one class ({test_counts.index.tolist()}). Metrics might be affected. Test indices: {test_start_idx}-{test_end_idx-1}")
            # Allow continuing, but metrics like precision/recall/F1 might be 0 or undefined.

        # --- Calculate scale_pos_weight (based on full training data for this window) ---
        neg_count = train_counts.get(0, 0)
        pos_count = train_counts.get(1, 0)
        scale_pos_weight_val = neg_count / pos_count if pos_count > 0 else 1.0

        # --- Inner Loop: Simple Grid Search for Hyperparameters ---
        iter_start_time = time.time()
        best_val_score = -np.inf # Using F1 score, higher is better
        best_params_iter = None

        # Split training data for inner validation
        val_size = int(len(X_train_full) * INNER_CV_VALIDATION_PCT)
        if val_size < min_test_samples or (len(X_train_full) - val_size) < min_test_samples: # Ensure both train_sub and val are usable
            print(f"Warning iter {iteration_count + 1}: Train subset or Validation set too small (Total Train: {len(X_train_full)}, Val size: {val_size}). Skipping inner CV.")
            best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0] # Use first combo as default
        else:
            X_train_sub = X_train_full[:-val_size]
            y_train_sub = y_train_full[:-val_size]
            X_val = X_train_full[-val_size:]
            y_val = y_train_full[-val_size:]

            # Check if validation sub-set has both classes
            if len(y_val.unique()) < 2 or len(y_train_sub.unique()) < 2:
                print(f"Warning iter {iteration_count + 1}: Inner train ({len(y_train_sub.unique())} classes) or validation set ({len(y_val.unique())} classes) has only one class. Skipping inner CV.")
                best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0] # Use first combo
            else:
                # Iterate through the small grid
                for params_cv in ParameterGrid(INNER_CV_PARAM_GRID):
                    try:
                        current_cv_params = {**XGB_FIXED_PARAMS, **params_cv}
                        model_cv = XGBClassifier(**current_cv_params,
                                                 scale_pos_weight=scale_pos_weight_val,
                                                 use_label_encoder=False)

                        model_cv.fit(X_train_sub, y_train_sub,
                                     eval_set=[(X_val, y_val)],
                                     early_stopping_rounds=10,
                                     verbose=False)

                        y_pred_val = model_cv.predict(X_val)
                        # Use F1 score on validation set - good for imbalance
                        val_score = f1_score(y_val, y_pred_val, average='binary', pos_label=1, zero_division=0)

                        if val_score > best_val_score:
                            best_val_score = val_score
                            best_params_iter = params_cv

                    except Exception as e_cv:
                        print(f"Error during inner CV iter {iteration_count + 1} with params {params_cv}: {e_cv}")
                        if best_params_iter is None: # If first CV try fails
                            best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0]


        # Handle case where inner CV failed entirely or was skipped
        if best_params_iter is None:
            print(f"Warning iter {iteration_count + 1}: Inner CV did not find best params (or was skipped), using default.")
            best_params_iter = list(ParameterGrid(INNER_CV_PARAM_GRID))[0] # Default to first combo

        # --- Train Final Model for this Iteration ---
        final_iter_params = {**XGB_FIXED_PARAMS, **best_params_iter}

        # Print progress less frequently
        if (iteration_count == 0) or ((iteration_count + 1) % 20 == 0) or (test_end_idx >= n_rows_total): # Print first, every 20, and last
             print(f"\nIter {iteration_count + 1}/{total_iterations_estimate}: "
                   f"Train Indices [{current_train_start_idx}:{train_end_idx-1}], "
                   f"Test Indices [{test_start_idx}:{test_end_idx-1}], "
                   f"Best Params Found: {best_params_iter}")

        current_model = XGBClassifier(**final_iter_params,
                                      scale_pos_weight=scale_pos_weight_val,
                                      use_label_encoder=False)

        try:
            # Train on the FULL training set for this window
            current_model.fit(X_train_full, y_train_full, verbose=False)

            # --- Prediction and Evaluation ---
            y_pred = current_model.predict(X_test)
            accuracy = accuracy_score(y_test, y_pred)
            precision = precision_score(y_test, y_pred, average='binary', pos_label=1, zero_division=0)
            recall = recall_score(y_test, y_pred, average='binary', pos_label=1, zero_division=0)
            f1 = f1_score(y_test, y_pred, average='binary', pos_label=1, zero_division=0)

            all_metrics['accuracy'].append(accuracy)
            all_metrics['precision'].append(precision)
            all_metrics['recall'].append(recall)
            all_metrics['f1'].append(f1)

            iteration_count += 1 # Increment successful iteration count
            iter_end_time = time.time()
            # Optional: print metrics per iteration
            # print(f"  Metrics: Acc={accuracy:.4f}, Prc={precision:.4f}, Rec={recall:.4f}, F1={f1:.4f} (took {iter_end_time - iter_start_time:.2f}s)")

        except Exception as model_err:
            print(f"ERROR during model training or prediction in Iter {iteration_count + 1}: {model_err}")
            print(f"  Train shape: {X_train_full.shape}, Test shape: {X_test.shape}")
            print(f"  Train classes: {y_train_full.value_counts().to_dict()}, Test classes: {y_test.value_counts().to_dict()}")
            # Optionally add more debug info here, e.g., check for NaNs again
            check_nan_inf_train = X_train_full.isnull().sum().sum() + np.isinf(X_train_full.values).sum()
            check_nan_inf_test = X_test.isnull().sum().sum() + np.isinf(X_test.values).sum()
            print(f"  NaN/Inf check before fit: Train={check_nan_inf_train}, Test={check_nan_inf_test}")
            # Skip this iteration's metrics if model failed
            print("  Skipping metrics for this iteration due to error.")


        # --- Move to Next Window ---
        current_train_start_idx += STEP_ROWS


    end_loop_time = time.time()
    print("-" * 30)
    loop_duration_minutes = (end_loop_time - start_loop_time) / 60
    print(f"Walk-Forward Validation finished in {end_loop_time - start_loop_time:.2f} seconds ({loop_duration_minutes:.2f} minutes).")

    # --- 6. Aggregate and Display Results ---
    print("\n--- 6. Final Results ---")
    if iteration_count > 0 and len(all_metrics['f1']) > 0 : # Check if any iterations successfully completed and added metrics
        avg_accuracy = np.mean(all_metrics['accuracy'])
        avg_precision = np.mean(all_metrics['precision'])
        avg_recall = np.mean(all_metrics['recall'])
        avg_f1 = np.mean(all_metrics['f1'])

        print("\n--- Average Walk-Forward Validation Results (XGBoost with Inner CV) ---")
        print(f"Total Folds / Successful Iterations Evaluated: {iteration_count}")
        print(f"Target Threshold: {TARGET_THRESHOLD_PCT}% increase over {TEST_WINDOW_HOURS} hours")
        print(f"Train Window: {TRAIN_WINDOW_HOURS} hours, Test Window: {TEST_WINDOW_HOURS} hours, Step: {STEP_HOURS} hours")
        print(f"Average Accuracy:  {avg_accuracy:.4f}")
        print(f"Average Precision: {avg_precision:.4f} (for class 1: return >= {TARGET_THRESHOLD_PCT}%)")
        print(f"Average Recall:    {avg_recall:.4f} (for class 1: return >= {TARGET_THRESHOLD_PCT}%)")
        print(f"Average F1-Score:  {avg_f1:.4f} (for class 1: return >= {TARGET_THRESHOLD_PCT}%)")

        std_accuracy = np.std(all_metrics['accuracy'])
        std_precision = np.std(all_metrics['precision'])
        std_recall = np.std(all_metrics['recall'])
        std_f1 = np.std(all_metrics['f1'])
        print("\n--- Standard Deviation of Metrics Across Folds ---")
        print(f"Std Dev Accuracy:  {std_accuracy:.4f}")
        print(f"Std Dev Precision: {std_precision:.4f}")
        print(f"Std Dev Recall:    {std_recall:.4f}")
        print(f"Std Dev F1-Score:  {std_f1:.4f}")
    else:
        print("\nNo iterations were successfully completed or no metrics were recorded.")

    print("\nScript finished.")

--- 1. Data Loading from CSV ---
Loading data from: C:\Users\mason\AVP\BTCUSD.csv
Raw data loaded. Shape: (60403, 9)
Columns after initial processing: ['open', 'high', 'low', 'close', 'Volume BTC', 'Volume USD', 'timestamp']
Time range: 2018-05-15 06:00:00 to 2025-04-05 00:00:00

--- 2. Feature Engineering ---
Starting feature calculation...
Feature calculation finished. Returning 60403 rows with 112 columns.
Feature calculation completed in 0.48 seconds.
Shape after feature calculation: (60403, 112)

--- 3. Data Cleaning (Features) ---
Missing/Infinite values before cleaning: 962 / 0
Missing/Infinite values after cleaning: 0 / 0
Rows removed due to NaNs in critical features: 215 (0.36%)
Shape after cleaning: (60188, 112)

--- 4. Modeling Target & Feature Preparation ---
Creating dummy variables for symbol 'BTCUSD'...
Creating binary target 'target' (24-hour return >= 0.1%)...
NaN values in target before drop: 0 (Expected due to shift)
Rows after removing NaN targets: 60188

Target var

Computational Overhead: This adds a substantial calculation step before every single training iteration. At a 5-minute retraining frequency, calculating weighted importances across 100 past results, selecting features, and subsetting the data adds significant time and complexity to each cycle. This might make keeping up with the 5-minute interval difficult.