VIF: Stands for Variance Inflation Factor.

Purpose: 
Measures how much the variance of an estimated regression coefficient increases due to multicollinearity (correlation between predictor variables).

Interpretation:

-Quantifies how well one predictor feature can be explained by all other predictor features in the model.

-High VIF: The feature is highly correlated with other features; its information is largely redundant.

-Low VIF: The feature is relatively independent of other features.


Common Thresholds (Rules of Thumb):

-VIF = 1: No multicollinearity (ideal).

-1 < VIF < 5: Moderate multicollinearity (often acceptable).

-5 <= VIF < 10: High multicollinearity (potentially problematic, common threshold range).

-VIF >= 10: Very high multicollinearity (feature likely redundant, coefficients unstable in linear models).


Feature Selection Use:

-Used iteratively: Calculate VIF for all features, remove the one with the highest VIF if it exceeds a set threshold, repeat until all remaining features are below the threshold.

Helps create a less redundant feature set, potentially improving model stability and interpretability.severe overfitting issues.

In [1]:
import pandas as pd
import numpy as np
import time
import os
import warnings
import traceback

# Feature Engineering Imports
import pandas_ta as ta

# Feature Selection Imports
from sklearn.feature_selection import VarianceThreshold
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.sm_exceptions import PerfectSeparationError # For handling VIF errors

# --- Suppress Warnings ---
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=pd.errors.PerformanceWarning)
warnings.filterwarnings('ignore') # General suppression

# --- Configuration ---
CSV_FILE_PATH = r'C:\Users\mason\AVP\BTCUSD.csv' # Use raw string for Windows paths
SYMBOL_NAME = 'BTCUSD' # Define the symbol represented in the CSV

# Feature Selection Thresholds
VARIANCE_THRESHOLD = 0.0003  # Remove features with zero variance (constants)
VIF_THRESHOLD = 6.9      # Remove features iteratively with VIF >= this value

# Define columns that are NOT features to be selected (identifiers, raw data, target)
# Make sure this list is comprehensive for your data
NON_FEATURE_COLS = [
    'timestamp', 'symbol', 'open', 'high', 'low', 'close', 'volume_btc', 'volume_usd',
    'target', 'future_price', 'price_return_future', # Include potential target-related cols if they exist
    'unix', 'date', 'symbol_csv', # From original loading if present before dropping
    'volumefrom', 'volumeto', # Intermediate volume names if present
    'tr' # Intermediate calculation like True Range if not dropped inside function
]

# --- Feature Engineering Function ---
def calculate_advanced_features(df, symbol):
    """Calculate all the advanced features from OHLCV data"""
    print("Starting feature calculation...")
    if df is None or len(df) < 3:
        print("Input DataFrame is too small or None.")
        return pd.DataFrame()

    df = df.copy()
    df['symbol'] = symbol # Add symbol column needed by some logic/output structure

    # Ensure timestamp is datetime and set as index temporarily
    if 'timestamp' not in df.columns:
        print("Error: 'timestamp' column not found.")
        return pd.DataFrame()
    try:
        df['timestamp'] = pd.to_datetime(df['timestamp'])
    except Exception as e:
        print(f"Error converting timestamp column to datetime: {e}")
        return pd.DataFrame()

    df = df.sort_values('timestamp')
    if df['timestamp'].isnull().any():
         print("Warning: NaN values found in timestamp after conversion. Dropping rows.")
         df = df.dropna(subset=['timestamp'])

    df = df.set_index('timestamp', drop=False)

    # Rename volume columns if they exist (handle potential missing cols)
    if 'volumefrom' in df.columns:
         df['volume_btc'] = df['volumefrom']
    elif 'Volume BTC' in df.columns:
         df['volume_btc'] = df['Volume BTC']
    else:
         # print("Warning: 'volumefrom' or 'Volume BTC' column not found. Volume features may fail.")
         df['volume_btc'] = 0 # Add dummy column

    if 'volumeto' in df.columns:
         df['volume_usd'] = df['volumeto']
    elif 'Volume USD' in df.columns:
         df['volume_usd'] = df['Volume USD']
    else:
         # print("Warning: 'volumeto' or 'Volume USD' column not found.")
         df['volume_usd'] = 0 # Add dummy column

    # Basic Checks
    required_cols = ['open', 'high', 'low', 'close', 'volume_btc']
    for col in required_cols:
        if col not in df.columns:
            print(f"Error: Required column '{col}' not found in DataFrame.")
            return pd.DataFrame()
        # Ensure numeric types
        df[col] = pd.to_numeric(df[col], errors='coerce')

    if df[required_cols].isnull().any().any():
        print(f"Warning: NaNs found in required OHLCV columns. Rows before drop: {len(df)}")
        df = df.dropna(subset=required_cols)
        print(f"Rows after dropping NaNs in OHLCV: {len(df)}")
        if df.empty: return pd.DataFrame()

    # --- Time-Based Features ---
    hour_of_day = df.index.hour
    day_of_week = df.index.dayofweek
    for hour in range(24): df[f'hour_{hour}'] = (hour_of_day == hour).astype(int)
    for day in range(7): df[f'day_{day}'] = (day_of_week == day).astype(int)

    # --- Basic Price & Volume Changes ---
    df['price_return_1h'] = df['close'].pct_change()
    df['oc_change_pct'] = (df['close'] - df['open']) / df['open']
    df['price_range_pct'] = (df['high'] - df['low']) / df['low']
    df['volume_return_1h'] = df['volume_btc'].pct_change()

    # --- Moving Averages and Rolling Std ---
    min_periods_base = 2
    for hours in [3, 6, 12, 24, 48, 72, 168]:
        if len(df) >= hours:
            df[f'ma_{hours}h'] = df['close'].rolling(window=hours, min_periods=min_periods_base).mean()
            df[f'rolling_std_{hours}h'] = df['close'].rolling(window=hours, min_periods=min_periods_base).std()
        else: df[f'ma_{hours}h'], df[f'rolling_std_{hours}h'] = np.nan, np.nan

    # --- Lagged Price & Volume Returns ---
    for hours in [3, 6, 12, 24, 48, 72, 168]: df[f'lag_{hours}h_price_return'] = df['close'].pct_change(periods=hours)
    for hours in [3, 6, 12, 24]: df[f'lag_{hours}h_volume_return'] = df['volume_btc'].pct_change(periods=hours)

    # --- Volatility Measures ---
    log_hl = np.log(df['high'].astype(float) / df['low'].astype(float))**2
    log_co = np.log(df['close'].astype(float) / df['open'].astype(float))**2
    df['garman_klass_12h'] = np.sqrt(0.5 * log_hl.rolling(12, min_periods=min(12, len(df))).sum() - (2*np.log(2)-1) * log_co.rolling(12, min_periods=min(12, len(df))).sum()) if len(df) >= 12 else np.nan
    df['parkinson_3h'] = np.sqrt((1 / (4 * np.log(2))) * (np.log(df['high'].astype(float) / df['low'].astype(float))**2).rolling(3, min_periods=min(3, len(df))).sum()) if len(df) >= 3 else np.nan
    df['tr'] = np.maximum(df['high'] - df['low'], np.maximum(abs(df['high'] - df['close'].shift(1)), abs(df['low'] - df['close'].shift(1))))
    for hours in [14, 24, 48]:
        if len(df) >= hours: df[f'atr_{hours}h'] = df['tr'].rolling(hours, min_periods=hours).mean()
        else: df[f'atr_{hours}h'] = np.nan

    # --- Momentum Indicators (RSI, MACD) ---
    if len(df) >= 14:
        delta = df['close'].diff()
        gain = (delta.where(delta > 0, 0)).rolling(window=14, min_periods=14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(window=14, min_periods=14).mean()
        with np.errstate(divide='ignore', invalid='ignore'): rs = gain / loss
        df['rsi_14h'] = (100 - (100 / (1 + rs))).replace([np.inf], 100).fillna(50)
    else: df['rsi_14h'] = 50.0
    min_ema_period = 26
    if len(df) >= min_ema_period:
        ema_12 = df['close'].ewm(span=12, adjust=False, min_periods=12).mean()
        ema_26 = df['close'].ewm(span=26, adjust=False, min_periods=26).mean()
        df['macd'] = ema_12 - ema_26
        df['macd_signal'] = df['macd'].ewm(span=9, adjust=False, min_periods=9).mean()
        df['macd_hist'] = df['macd'] - df['macd_signal']
    else: df['macd'], df['macd_signal'], df['macd_hist'] = np.nan, np.nan, np.nan

    # --- Volume-Based Indicators ---
    for hours in [12, 24, 72, 168]:
        if len(df) >= hours: df[f'volume_ma_{hours}h'] = df['volume_btc'].rolling(window=hours, min_periods=min_periods_base).mean()
        else: df[f'volume_ma_{hours}h'] = np.nan
    with np.errstate(divide='ignore', invalid='ignore'): df['volume_div_ma_24h'] = df['volume_btc'] / df['volume_ma_24h']
    obv = (np.sign(df['close'].diff()) * df['volume_btc']).fillna(0).cumsum(); df['obv'] = obv
    if len(df) >= 24: df['obv_ma_24h'] = df['obv'].rolling(window=24, min_periods=min_periods_base).mean()
    else: df['obv_ma_24h'] = np.nan

    # --- Ratio Features ---
    with np.errstate(divide='ignore', invalid='ignore'):
        df['close_div_ma_24h'] = df['close'] / df['ma_24h']
        df['close_div_ma_48h'] = df['close'] / df['ma_48h']
        df['close_div_ma_168h'] = df['close'] / df['ma_168h']
        df['ma12_div_ma48'] = df['ma_12h'] / df['ma_48h']
        df['ma24_div_ma168'] = df['ma_24h'] / df['ma_168h']
        df['std12_div_std72'] = df['rolling_std_12h'] / df['rolling_std_72h']
        df['atr_14_div_atr_48'] = df['atr_14h'] / df['atr_48h']

    # --- Normalized Price Position (Stochastic) ---
    if len(df) >= 14:
        low_14 = df['low'].rolling(window=14, min_periods=14).min()
        high_14 = df['high'].rolling(window=14, min_periods=14).max()
        range_14 = high_14 - low_14
        with np.errstate(divide='ignore', invalid='ignore'):
            df['stoch_k'] = (100 * (df['close'] - low_14) / range_14).replace(np.inf, 50).fillna(50)
        if len(df) >= 16: df['stoch_d'] = df['stoch_k'].rolling(window=3, min_periods=3).mean()
        else: df['stoch_d'] = np.nan
    else: df['stoch_k'], df['stoch_d'] = 50.0, np.nan
    with np.errstate(divide='ignore', invalid='ignore'): df['z_score_24h'] = (df['close'] - df['ma_24h']) / df['rolling_std_24h']

    # --- Higher-Order Statistics ---
    if len(df) >= 24:
        df['rolling_skew_24h'] = df['price_return_1h'].rolling(window=24, min_periods=24).skew()
        df['rolling_kurt_24h'] = df['price_return_1h'].rolling(window=24, min_periods=24).kurt()
    else: df['rolling_skew_24h'], df['rolling_kurt_24h'] = np.nan, np.nan

    # --- Interaction & Non-linear ---
    df['volume_btc_x_range'] = df['volume_btc'] * df['price_range_pct']
    df['rolling_std_3h_sq'] = df['rolling_std_3h'] ** 2
    df['price_return_1h_sq'] = df['price_return_1h'] ** 2
    df['rolling_std_12h_sqrt'] = np.sqrt(df['rolling_std_12h'].abs())

    # --- pandas_ta Features ---
    # Use a temporary df for ta functions that might expect specific column names or modify the df inplace
    ta_df = df.rename(columns={'volume_btc': 'volume'}, errors='ignore') # Rename for ta library if needed

    # Check for required columns before calling ta functions
    if all(c in ta_df.columns for c in ['high', 'low', 'close']):
        # ADX/DMI
        try: # Add try-except for robustness
            adx_df = ta_df.ta.adx(length=14) # Check pandas_ta documentation for exact column names
            if adx_df is not None and not adx_df.empty:
                 # Adjust column names based on your pandas_ta version output
                 df['adx_14h'] = adx_df.get(f'ADX_14', np.nan)
                 df['plus_di_14h'] = adx_df.get(f'DMP_14', np.nan)
                 df['minus_di_14h'] = adx_df.get(f'DMN_14', np.nan)
            else: df['adx_14h'], df['plus_di_14h'], df['minus_di_14h'] = np.nan, np.nan, np.nan
        except Exception as e:
            print(f"Warning: Error calculating ADX: {e}")
            df['adx_14h'], df['plus_di_14h'], df['minus_di_14h'] = np.nan, np.nan, np.nan

        # Accumulation/Distribution Line (requires volume)
        if 'volume' in ta_df.columns:
             try: df['ad_line'] = ta_df.ta.ad()
             except Exception as e: print(f"Warning: Error calculating AD Line: {e}"); df['ad_line'] = np.nan
        else: df['ad_line'] = np.nan

        # Chaikin Money Flow (requires volume)
        if 'volume' in ta_df.columns:
             try: df['cmf_20h'] = ta_df.ta.cmf(length=20)
             except Exception as e: print(f"Warning: Error calculating CMF: {e}"); df['cmf_20h'] = np.nan
        else: df['cmf_20h'] = np.nan

        # Commodity Channel Index
        try: df['cci_20h'] = ta_df.ta.cci(length=20)
        except Exception as e: print(f"Warning: Error calculating CCI: {e}"); df['cci_20h'] = np.nan

        # Bollinger Bands
        try:
            bbands_df = ta_df.ta.bbands(length=20, std=2)
            if bbands_df is not None and not bbands_df.empty:
                # Adjust column names based on your pandas_ta version output (e.g., BBB_20_2.0, BBP_20_2.0)
                df['bband_width_20h'] = bbands_df.get(f'BBB_20_2.0', np.nan) # Bandwidth
                df['bband_pctb_20h'] = bbands_df.get(f'BBP_20_2.0', np.nan) # %B
            else: df['bband_width_20h'], df['bband_pctb_20h'] = np.nan, np.nan
        except Exception as e:
            print(f"Warning: Error calculating Bollinger Bands: {e}")
            df['bband_width_20h'], df['bband_pctb_20h'] = np.nan, np.nan
    else:
        print("Warning: Missing high, low, or close columns. Skipping some pandas_ta features.")
        # Add NaN columns explicitly if base cols are missing
        df['adx_14h'], df['plus_di_14h'], df['minus_di_14h'] = np.nan, np.nan, np.nan
        df['ad_line'], df['cmf_20h'], df['cci_20h'] = np.nan, np.nan, np.nan
        df['bband_width_20h'], df['bband_pctb_20h'] = np.nan, np.nan


    # --- Position in Range ---
    range_hl = df['high'] - df['low']
    with np.errstate(divide='ignore', invalid='ignore'):
        df['close_pos_in_range'] = ((df['close'] - df['low']) / range_hl).fillna(0.5).replace([np.inf, -np.inf], 0.5)

    # --- Additional Features ---
    df['log_return_1h'] = np.log(df['close'] / df['close'].shift(1))
    # VWAP (24h)
    if 'volume_btc' in df.columns and len(df) >= 24:
        numerator = (df['close'] * df['volume_btc']).rolling(window=24, min_periods=24).sum()
        denominator = df['volume_btc'].rolling(window=24, min_periods=24).sum()
        with np.errstate(divide='ignore', invalid='ignore'): df['vwap_24h'] = numerator / denominator
        df['vwap_24h'] = df['vwap_24h'].fillna(method='ffill') # Optional: forward fill VWAP NaNs
    else: df['vwap_24h'] = np.nan
    # Stochastic RSI using pandas_ta
    if 'close' in ta_df.columns:
        try:
            stochrsi_df = ta_df.ta.stochrsi(length=14)
            if stochrsi_df is not None and not stochrsi_df.empty:
                # Adjust column names based on pandas_ta version output
                k_col = next((col for col in stochrsi_df.columns if 'STOCHRSIk' in col), None)
                d_col = next((col for col in stochrsi_df.columns if 'STOCHRSId' in col), None)
                df['stoch_rsi_k'] = stochrsi_df[k_col] if k_col else np.nan
                df['stoch_rsi_d'] = stochrsi_df[d_col] if d_col else np.nan
            else: df['stoch_rsi_k'], df['stoch_rsi_d'] = np.nan, np.nan
        except Exception as e:
            print(f"Warning: Error calculating StochRSI: {e}")
            df['stoch_rsi_k'], df['stoch_rsi_d'] = np.nan, np.nan
    else: df['stoch_rsi_k'], df['stoch_rsi_d'] = np.nan, np.nan


    # --- Final Cleanup ---
    df = df.reset_index(drop=True)
    df = df.replace([np.inf, -np.inf], np.nan) # Final check for inf

    # Remove intermediate features if desired (e.g., 'tr' was used for ATR)
    df = df.drop(columns=['tr'], errors='ignore')

    print(f"Feature calculation finished. Returning {len(df)} rows with {len(df.columns)} columns.")
    return df


# --- Helper Function for VIF Calculation ---
def calculate_vif(df_features, vif_threshold=10.0, verbose=True):
    """
    Calculates VIF for a DataFrame of features and iteratively removes
    features with VIF above the threshold.

    Args:
        df_features (pd.DataFrame): DataFrame containing ONLY the numeric features to check.
        vif_threshold (float): The threshold above which features will be removed.
        verbose (bool): If True, print messages about dropped features.

    Returns:
        list: A list of feature names (strings) that remain after VIF filtering.
    """
    if df_features.empty:
        print("VIF calculation skipped: Input DataFrame is empty.")
        return []

    if verbose: print(f"\n--- Feature Selection: Calculating VIF (Threshold: {vif_threshold}) ---")
    features = df_features.copy()
    original_feature_count = features.shape[1]
    if verbose: print(f"Features before VIF check: {original_feature_count}")

    # VIF cannot handle NaN or Inf values. Impute temporarily for calculation.
    num_nans_before = features.isnull().sum().sum()
    if num_nans_before > 0:
        if verbose: print(f"  Warning: Found {num_nans_before} NaNs. Imputing with column medians for VIF calculation ONLY.")
        medians = features.median() # Calculate medians once
        features = features.fillna(medians)
        # Verify imputation worked
        num_nans_after = features.isnull().sum().sum()
        if num_nans_after > 0:
             print(f"  ERROR: {num_nans_after} NaNs remain after median imputation. Dropping columns with all NaNs before VIF.")
             cols_all_nan = features.columns[features.isnull().all()].tolist()
             print(f"  Dropping: {cols_all_nan}")
             features = features.dropna(axis=1, how='all')
             if features.empty:
                  print("  ERROR: DataFrame empty after dropping all-NaN columns. VIF calculation aborted.")
                  return [] # Return empty list

    # Check for infinities (should have been handled earlier, but robust check)
    num_infs = np.isinf(features.values).sum()
    if num_infs > 0:
        print(f"  ERROR: Found {num_infs} infinite values even after initial cleaning. Cannot calculate VIF. Aborting.")
        # Return original column names from input df_features as fallback
        return df_features.columns.tolist()

    # Check for constant columns after imputation (VIF calculation will fail)
    constant_cols = features.columns[features.nunique() <= 1].tolist()
    if constant_cols:
        if verbose: print(f"  Warning: Removing constant columns found after imputation before VIF: {constant_cols}")
        features = features.drop(columns=constant_cols)
        if features.empty:
            print("  ERROR: DataFrame empty after dropping constant columns post-imputation. VIF calculation aborted.")
            return [] # Return empty list

    # Initial list of features to check
    kept_features = features.columns.tolist()
    dropped_features_vif = [] # Keep track of dropped features

    # Iterative VIF calculation
    while len(kept_features) > 1: # Need at least 2 features for VIF
        vif_data = pd.DataFrame()
        vif_data["feature"] = kept_features
        try:
            # Calculate VIF - ensure features[kept_features] is used
            vif_values = [variance_inflation_factor(features[kept_features].values, i)
                          for i in range(len(kept_features))]
            vif_data["VIF"] = vif_values
        except PerfectSeparationError:
            # Handle perfect multicollinearity error
            print("  Error: Perfect separation detected during VIF calculation (implies perfect linear dependency).")
            # Option 1: Stop VIF here
            # print("  Stopping VIF calculation prematurely.")
            # break
            # Option 2: Try to identify and remove one of the likely culprits (more complex)
            # For simplicity, we stop here. Manual inspection might be needed.
            print("  Stopping VIF calculation. Manual inspection of remaining features recommended.")
            break
        except ValueError as ve:
            # Catch errors like "contains NaNs" if imputation failed somehow
             print(f"  ValueError during VIF calculation: {ve}. Check data for unexpected NaNs/Infs.")
             break
        except Exception as e:
            # Catch other potential errors
            print(f"  Unexpected error calculating VIF: {e}")
            print("  Stopping VIF calculation.")
            break

        max_vif = vif_data['VIF'].max()

        # Check for NaN/Inf in VIF results (can happen with near-perfect multicollinearity)
        is_nan_inf = vif_data['VIF'].isna() | np.isinf(vif_data['VIF'])
        if is_nan_inf.any():
             nan_inf_vif_features = vif_data.loc[is_nan_inf, 'feature'].tolist()
             if verbose: print(f"  Warning: Found NaN/Inf VIF for features: {nan_inf_vif_features}. Indicates extreme multicollinearity.")
             # Drop the first feature found with NaN/Inf VIF in this iteration
             feature_to_drop_inf = nan_inf_vif_features[0]
             if verbose: print(f"  Dropping feature '{feature_to_drop_inf}' due to NaN/Inf VIF.")
             kept_features.remove(feature_to_drop_inf)
             dropped_features_vif.append(feature_to_drop_inf)
             continue # Recalculate VIF with the reduced set

        # Check if the highest VIF is below the threshold
        if max_vif < vif_threshold:
            if verbose: print(f"  Max VIF ({max_vif:.2f}) is below threshold {vif_threshold}. Stopping.")
            break
        else:
            # Find and drop the feature with the highest VIF
            feature_to_drop = vif_data.sort_values('VIF', ascending=False)['feature'].iloc[0]
            if verbose: print(f"  Dropping '{feature_to_drop}' (VIF: {max_vif:.2f})")
            kept_features.remove(feature_to_drop)
            dropped_features_vif.append(feature_to_drop)

    # End of VIF loop
    if verbose:
        print(f"Removed {len(dropped_features_vif)} features based on VIF.")
        if dropped_features_vif:
            # Sort for consistent output
            print(f"  Features Dropped by VIF: {sorted(list(dropped_features_vif))}")
        print(f"Features remaining after VIF check: {len(kept_features)}")
        print(f"--- VIF calculation finished. ---")

    # Return the final list of features that passed the VIF check
    # These names correspond to columns in the original df_features input
    final_kept_feature_names = [col for col in df_features.columns if col in kept_features]
    return final_kept_feature_names


# --- Main Execution Block ---
if __name__ == "__main__":

    print("--- 1. Data Loading ---")
    try:
        print(f"Loading data from: {CSV_FILE_PATH}")
        # Adjust column names based on your actual CSV header
        col_names = ['unix', 'date', 'symbol_csv', 'open', 'high', 'low', 'close', 'Volume BTC', 'Volume USD']
        df_raw = pd.read_csv(CSV_FILE_PATH, header=0, names=col_names) # Assumes header is present, uses provided names
        print(f"Raw data loaded. Shape: {df_raw.shape}")

        # Basic preprocessing: Use 'date' column, drop others
        df_raw['timestamp'] = pd.to_datetime(df_raw['date'])
        df_raw = df_raw.drop(['unix', 'date', 'symbol_csv'], axis=1) # Drop original time/symbol cols
        df_raw = df_raw.sort_values('timestamp').reset_index(drop=True)

        if df_raw.empty:
            print("DataFrame is empty after loading/initial processing. Exiting.")
            exit()
        print(f"Data prepared for feature engineering. Shape: {df_raw.shape}")
        print(f"Columns: {df_raw.columns.tolist()}")

    except FileNotFoundError:
        print(f"Error: CSV file not found at {CSV_FILE_PATH}")
        exit()
    except Exception as e:
        print(f"Error loading or processing CSV: {e}")
        traceback.print_exc()
        exit()

    print("\n--- 2. Feature Engineering ---")
    feature_calc_start = time.time()
    df_full_features = calculate_advanced_features(df_raw, symbol=SYMBOL_NAME)
    feature_calc_end = time.time()

    if df_full_features.empty:
        print("Feature calculation failed or resulted in empty DataFrame. Exiting.")
        exit()
    print(f"Feature calculation completed in {feature_calc_end - feature_calc_start:.2f} seconds.")
    print(f"Total columns after feature engineering: {len(df_full_features.columns)}")

    # Identify potential features by excluding NON_FEATURE_COLS
    all_potential_features = [col for col in df_full_features.columns if col not in NON_FEATURE_COLS]
    print(f"Identified {len(all_potential_features)} potential features to analyze.")

    # Create DataFrame with only potential features for selection steps
    df_features_only = df_full_features[all_potential_features].copy()

    print("\n--- 3. Cleaning Infinities (in Feature Columns) ---")
    infinite_before = np.isinf(df_features_only.values).sum()
    df_features_only = df_features_only.replace([np.inf, -np.inf], np.nan)
    print(f"Replaced {infinite_before - np.isinf(df_features_only.values).sum()} infinite values with NaN.")

    # --- 4. Feature Selection: Variance Threshold ---
    print(f"\n--- 4. Feature Selection: Variance Threshold (Threshold: {VARIANCE_THRESHOLD}) ---")
    features_before_variance = df_features_only.columns.tolist()
    print(f"Features before variance check: {len(features_before_variance)}")

    # VarianceThreshold needs non-NaN data to fit. Use median imputation temporarily FOR FITTING ONLY.
    # Create a temporary imputed version for fitting the selector
    df_temp_imputed_var = df_features_only.copy()
    num_nans_variance = df_temp_imputed_var.isnull().sum().sum()
    if num_nans_variance > 0:
        print(f"  Imputing {num_nans_variance} NaNs with medians temporarily for VarianceThreshold fitting.")
        df_temp_imputed_var = df_temp_imputed_var.fillna(df_temp_imputed_var.median())

    try:
        selector = VarianceThreshold(threshold=VARIANCE_THRESHOLD)
        # Fit on the imputed data
        selector.fit(df_temp_imputed_var)
        # Get the mask and apply to the *original* feature list (before imputation)
        features_after_variance = df_features_only.columns[selector.get_support()].tolist()
        dropped_variance = set(features_before_variance) - set(features_after_variance)
        print(f"Removed {len(dropped_variance)} features with variance <= {VARIANCE_THRESHOLD}.")
        if dropped_variance:
             # Sort for consistent output
             print(f"  Dropped by Variance Threshold: {sorted(list(dropped_variance))}")
    except Exception as e_var:
        print(f"Could not perform variance thresholding: {e_var}")
        print("Skipping variance filtering.")
        features_after_variance = features_before_variance # Keep all features if error

    print(f"Features remaining after variance check: {len(features_after_variance)}")

    # Update the DataFrame to contain only features that passed variance check
    df_features_selected = df_features_only[features_after_variance].copy()


    # --- 5. Feature Selection: VIF Threshold ---
    # Pass the DataFrame containing only features that passed the variance check
    features_after_vif = calculate_vif(
        df_features_selected, # Pass the dataframe with features selected so far
        vif_threshold=VIF_THRESHOLD,
        verbose=True
    )


    # --- 6. Final Selected Features ---
    print(f"\n--- 6. Final Feature Selection Summary ---")
    print(f"Total potential features identified:      {len(all_potential_features)}")
    print(f"Features after Variance Threshold ({VARIANCE_THRESHOLD}): {len(features_after_variance)}")
    print(f"Features after VIF Threshold ({VIF_THRESHOLD}):      {len(features_after_vif)}")
    print("\nFinal list of selected feature names:")
    # Print the list nicely
    if features_after_vif:
        for i, feature in enumerate(sorted(features_after_vif)): # Sort alphabetically for readability
            print(f"  {i+1}. {feature}")
    else:
        print("  No features remained after filtering.")

    print(f"\nTotal selected features: {len(features_after_vif)}")
    print("\nFeature selection script finished.")

--- 1. Data Loading ---
Loading data from: C:\Users\mason\AVP\BTCUSD.csv
Raw data loaded. Shape: (60403, 9)
Data prepared for feature engineering. Shape: (60403, 7)
Columns: ['open', 'high', 'low', 'close', 'Volume BTC', 'Volume USD', 'timestamp']

--- 2. Feature Engineering ---
Starting feature calculation...
Feature calculation finished. Returning 60403 rows with 115 columns.
Feature calculation completed in 0.46 seconds.
Total columns after feature engineering: 115
Identified 107 potential features to analyze.

--- 3. Cleaning Infinities (in Feature Columns) ---
Replaced 0 infinite values with NaN.

--- 4. Feature Selection: Variance Threshold (Threshold: 0.0003) ---
Features before variance check: 107
  Imputing 963 NaNs with medians temporarily for VarianceThreshold fitting.
Removed 9 features with variance <= 0.0003.
  Dropped by Variance Threshold: ['garman_klass_12h', 'lag_3h_price_return', 'lag_6h_price_return', 'log_return_1h', 'oc_change_pct', 'parkinson_3h', 'price_range_pc

Okay, that's a creative and logical extension to address the "Delayed Reaction" problem! Adding an event-based trigger (large price move) alongside the scheduled daily VIF recalculation is definitely possible and aims to make the feature selection more responsive during potentially regime-shifting moments.

Let's analyze this hybrid approach:

Concept:

Baseline: Recalculate VIF once a day using a long lookback window (e.g., past 24 hours or several days of minute data) to determine the standard active feature set.

Event Trigger: Monitor the price change over a short, recent period (e.g., last 5-15 minutes, or since the last model run). If the absolute percentage change exceeds a threshold (e.g., +/- 4%), immediately trigger another VIF recalculation using the standard long lookback window.

Feature Set Update: Whenever VIF is recalculated (either scheduled or event-triggered), update the "currently active" feature list used by the frequently running classification model.

Pros:

Faster Adaptation (Potential): In theory, this allows the system to reassess feature multicollinearity much faster during extreme moves than waiting for the next daily cycle. If the flash crash/pump does significantly alter the correlation structure, the model might benefit from an updated feature set sooner.

Targeted Recalculation: Avoids constant VIF recalculation (like every minute) but adds computation only when potentially necessary (during high volatility).

Cons & Considerations:

Correlation Calculation Lag: This is still the biggest challenge. Correlation (and therefore VIF) is calculated over a window. When a +/- 4% move happens very quickly, the bulk of the data within your chosen VIF calculation window (e.g., the last 24 hours) still represents the previous market regime. The VIF calculation triggered immediately might not show a drastically different result until the new regime dominates a larger portion of that window. The change in VIF might lag the price event considerably.

VIF Window Choice: Which window do you use for the event-triggered VIF?

Using the standard long window (e.g., 24h): More stable VIF calculation, but slow to reflect the immediate change (as mentioned above).

Using a very short window around the event: VIF calculation becomes highly unstable and unreliable. Not recommended.

The most practical approach is likely sticking to the standard long window even for event triggers, accepting the inherent lag.

Is Price Change the Right Trigger? Does a 4% move reliably indicate a change in correlation structure? Sometimes yes, sometimes no. It definitely indicates high volatility, but features might maintain their relationships. You might trigger recalculations that don't lead to significant changes in the selected feature set.

Whipsaws and Noise: In choppy markets, you might get multiple +/- 4% moves triggering frequent VIF recalculations, increasing computational load without necessarily providing a consistently better feature set if the underlying structure isn't fundamentally shifting back and forth that quickly.

Increased Complexity: Managing both scheduled and event-based triggers adds complexity to your system architecture and testing.

Downstream Model Impact: Frequent changes to the feature set (even if only hourly + occasional events) might require the downstream classification model to be retrained more often or require careful handling to ensure it can cope with varying input features.

Alternative/Refinement:

Trigger on Volatility Increase: Instead of price change, trigger the VIF recalculation if a volatility measure (e.g., rolling ATR over the VIF window, or short-term vs long-term volatility ratio) crosses a significant threshold. This might be slightly more indicative of a potential regime change than price alone, but still suffers from the lag issue.

More Frequent Scheduled Updates: Simply increasing the scheduled VIF update frequency (e.g., from daily to every 4 hours or hourly) might be a simpler and sufficiently effective way to reduce the delay without the complexities of event triggers.

Conclusion:

Adding an event-based trigger for VIF recalculation based on large price moves is an interesting idea and technically feasible. It could potentially reduce the reaction delay during major market shifts compared to only daily updates.

However, be aware of the significant challenge posed by the lag inherent in correlation/VIF calculations over a window. The immediate trigger might not yield an immediately different feature set. You also introduce complexity and potential noise triggers.

Recommendation:

Start Simpler: Implement the periodic VIF update first (e.g., daily or every few hours).

Monitor: Observe how stable the selected feature set is over time. Does it change significantly between updates?

Consider the Trade-off: If you observe that significant market events often seem to precede changes in the optimal feature set and the delay is hurting performance, then consider experimenting with event-based triggers.

Test Thoroughly: If you implement event triggers, rigorously test the impact on performance, computational load, and overall system stability, paying close attention to the actual change (or lack thereof) in the selected features immediately following events.

It might turn out that more frequent scheduled updates (e.g., hourly) provide a better balance of adaptivity, stability, and complexity than adding event-based triggers.

Post VIF Feature Transformations

In [2]:
import pandas as pd
import numpy as np
import time
import os
import warnings
from sklearn.feature_selection import VarianceThreshold
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.sm_exceptions import PerfectSeparationError

# Assume the following variables are available from the previous steps:
# - df_full_features: DataFrame containing all original columns + engineered features
# - features_after_vif: List of the 58 feature names that survived the first VIF pass
# - calculate_vif: The helper function for VIF calculation (defined previously)

print("\n--- 7. Advanced Feature Engineering (Post-Initial VIF) ---")

# Make sure features_after_vif is not empty
if not features_after_vif:
    print("Skipping advanced feature engineering: No features survived initial VIF.")
    final_selected_features = []
else:
    print(f"Starting with {len(features_after_vif)} features selected by initial VIF.")

    # Create a DataFrame with only the selected features for engineering
    df_eng = df_full_features[features_after_vif].copy()
    engineered_feature_names = list(features_after_vif) # Keep track

    # --- A. Univariate Transformations ---
    print("\n  --- A. Applying Univariate Transformations ---")
    # Define candidates based on names/types (adjust based on your actual feature list)
    # Ensure candidates actually exist in df_eng.columns
    cols_for_log = [c for c in ['Volume BTC', 'Volume USD', 'volume_ma_12h', 'volume_ma_168h', 'rolling_std_6h', 'rolling_std_48h', 'rolling_std_168h', 'bband_width_20h', 'volume_btc_x_range'] if c in df_eng.columns]
    cols_for_sqrt = [c for c in ['rolling_std_6h', 'rolling_std_48h', 'rolling_std_168h', 'rolling_std_3h_sq', 'bband_width_20h'] if c in df_eng.columns] # Non-negative candidates
    cols_for_sq = [c for c in ['lag_12h_price_return', 'lag_24h_price_return', 'lag_48h_price_return', 'lag_72h_price_return', 'lag_168h_price_return', 'macd_hist', 'macd_signal', 'cci_20h', 'volume_return_1h', 'volume_div_ma_24h'] if c in df_eng.columns]

    print(f"  Applying log1p to {len(cols_for_log)} features...")
    for col in cols_for_log:
        new_col_name = f"{col}_log1p"
        # Use log1p for stability with zeros; add small epsilon if negatives are possible and unwanted
        df_eng[new_col_name] = np.log1p(df_eng[col].fillna(0).clip(lower=0)) # fillna(0) before log, ensure non-negative
        engineered_feature_names.append(new_col_name)

    print(f"  Applying sqrt to {len(cols_for_sqrt)} features...")
    for col in cols_for_sqrt:
        new_col_name = f"{col}_sqrt"
        # Use np.abs for safety, though these should be non-negative
        with np.errstate(invalid='ignore'): # Ignore warnings for sqrt(negative) if abs wasn't used
             df_eng[new_col_name] = np.sqrt(np.abs(df_eng[col].fillna(0)))
        engineered_feature_names.append(new_col_name)

    print(f"  Applying square to {len(cols_for_sq)} features...")
    for col in cols_for_sq:
        new_col_name = f"{col}_sq"
        df_eng[new_col_name] = df_eng[col].fillna(0) ** 2
        engineered_feature_names.append(new_col_name)

    # --- B. Bivariate Interactions (Multiplication) ---
    print("\n  --- B. Creating Interaction Terms (Multiplication) ---")
    # Define categories based on feature names (adjust to your list!)
    # Ensure these lists only contain features PRESENT in df_eng.columns
    momentum_cols = [c for c in ['lag_12h_price_return', 'lag_24h_price_return', 'lag_48h_price_return', 'lag_72h_price_return', 'lag_168h_price_return', 'macd_hist', 'macd_signal', 'cci_20h'] if c in df_eng.columns]
    volume_cols = [c for c in ['Volume BTC', 'lag_12h_volume_return', 'lag_24h_volume_return', 'lag_3h_volume_return', 'lag_6h_volume_return', 'volume_ma_12h', 'volume_ma_168h', 'volume_div_ma_24h', 'volume_return_1h', 'cmf_20h'] if c in df_eng.columns]
    volatility_cols = [c for c in ['rolling_std_6h', 'rolling_std_48h', 'rolling_std_168h', 'bband_width_20h', 'std12_div_std72', 'rolling_kurt_24h'] if c in df_eng.columns] # Include Kurtosis as proxy?

    interaction_pairs = []
    # Momentum x Volume
    for m_col in momentum_cols:
        for v_col in volume_cols:
            # Avoid self-interaction if somehow a col is in both lists
            if m_col != v_col: interaction_pairs.append((m_col, v_col))
    # Momentum x Volatility
    for m_col in momentum_cols:
        for vo_col in volatility_cols:
            if m_col != vo_col: interaction_pairs.append((m_col, vo_col))
    # Volume x Volatility
    for v_col in volume_cols:
        for vo_col in volatility_cols:
             if v_col != vo_col: interaction_pairs.append((v_col, vo_col))

    # Remove duplicate pairs if any category overlaps
    interaction_pairs = list(set(interaction_pairs))
    print(f"  Creating {len(interaction_pairs)} interaction features...")

    interaction_count = 0
    for col1, col2 in interaction_pairs:
        # Check if both columns still exist (might have been removed if categories overlapped)
        if col1 in df_eng.columns and col2 in df_eng.columns:
            new_col_name = f"{col1}_x_{col2}"
            # Fill NaNs with 0 before multiplication? Or let NaNs propagate? Propagate is often better.
            df_eng[new_col_name] = df_eng[col1] * df_eng[col2]
            engineered_feature_names.append(new_col_name)
            interaction_count += 1
        # else: # Debugging if needed
        #     print(f"Skipping interaction: {col1} or {col2} not found.")

    print(f"  Successfully created {interaction_count} interaction features.")

    # --- C. Bivariate Interactions (Division / Ratios) ---
    # Be very selective. Example: std_short / std_long (if not already present)
    print("\n  --- C. Creating Interaction Terms (Division - Selective) ---")
    ratio_pairs = []
    # Example: if 'rolling_std_6h' and 'rolling_std_48h' survived
    if 'rolling_std_6h' in df_eng.columns and 'rolling_std_48h' in df_eng.columns:
        ratio_pairs.append(('rolling_std_6h', 'rolling_std_48h'))
    # Add other theoretically sound pairs IF NEEDED

    print(f"  Creating {len(ratio_pairs)} ratio features...")
    epsilon = 1e-9 # Small number to avoid division by zero
    for num_col, den_col in ratio_pairs:
         if num_col in df_eng.columns and den_col in df_eng.columns:
            new_col_name = f"{num_col}_div_{den_col}"
            # Handle division by zero robustly
            denominator = df_eng[den_col].fillna(0) # Fill NaNs in denominator? Risky if 0 is meaningful.
            df_eng[new_col_name] = df_eng[num_col].fillna(0) / (denominator + np.sign(denominator)*epsilon + epsilon) # Add epsilon based on sign
            # Alternatively, use np.where
            # df_eng[new_col_name] = np.where(
            #     np.abs(denominator) > epsilon,
            #     df_eng[num_col].fillna(0) / denominator,
            #     0 # Or np.nan, or median, etc.
            # )
            engineered_feature_names.append(new_col_name)

    # --- D. Clean Up Engineered Features ---
    print("\n  --- D. Cleaning Engineered Features ---")
    # Remove duplicates from the list (if any transformation/interaction created same name)
    engineered_feature_names = sorted(list(set(engineered_feature_names)))
    # Ensure all listed features are actually columns in df_eng
    engineered_feature_names = [f for f in engineered_feature_names if f in df_eng.columns]

    # Select only the columns intended for the final VIF check
    df_eng_final = df_eng[engineered_feature_names].copy()

    # Replace any new infinities created (e.g., by division) with NaN
    inf_count_before = np.isinf(df_eng_final.values).sum()
    df_eng_final.replace([np.inf, -np.inf], np.nan, inplace=True)
    inf_count_after = np.isinf(df_eng_final.values).sum()
    if inf_count_before > inf_count_after:
        print(f"  Replaced {inf_count_before - inf_count_after} new infinite values with NaN.")

    print(f"Total features after advanced engineering: {len(df_eng_final.columns)}")

    # --- E. Run VIF Again on the Expanded Set ---
    print("\n--- 8. Running VIF on Expanded Feature Set ---")
    # Use the same VIF threshold as before, or potentially a slightly higher one if needed
    final_selected_features = calculate_vif(
        df_eng_final,
        vif_threshold=VIF_THRESHOLD, # Use the globally defined threshold
        verbose=True
    )

    # --- F. Final Output ---
    print(f"\n--- 9. Final Selection Summary (Post-Advanced Engineering) ---")
    print(f"Features after initial VIF:              {len(features_after_vif)}")
    print(f"Features after advanced eng+transform: {len(df_eng_final.columns)}")
    print(f"Features after final VIF ({VIF_THRESHOLD}):      {len(final_selected_features)}")
    print("\nFinal list of selected feature names (after second VIF pass):")
    if final_selected_features:
        for i, feature in enumerate(sorted(final_selected_features)): # Sort alphabetically
            print(f"  {i+1}. {feature}")
    else:
        print("  No features remained after the final VIF filtering.")

    print(f"\nTotal final selected features: {len(final_selected_features)}")

# Add a final message outside the else block
print("\nAdvanced feature engineering and selection script finished.")

# The variable 'final_selected_features' now holds the names of the features
# that survived both rounds of VIF filtering.
# You would use this list in your subsequent modeling steps.


--- 7. Advanced Feature Engineering (Post-Initial VIF) ---
Starting with 58 features selected by initial VIF.

  --- A. Applying Univariate Transformations ---
  Applying log1p to 9 features...
  Applying sqrt to 5 features...
  Applying square to 10 features...

  --- B. Creating Interaction Terms (Multiplication) ---
  Creating 188 interaction features...
  Successfully created 188 interaction features.

  --- C. Creating Interaction Terms (Division - Selective) ---
  Creating 1 ratio features...

  --- D. Cleaning Engineered Features ---
Total features after advanced engineering: 271

--- 8. Running VIF on Expanded Feature Set ---

--- Feature Selection: Calculating VIF (Threshold: 6.9) ---
Features before VIF check: 271
  Dropping 'volume_return_1h_x_rolling_std_48h' (VIF: 404620.59)
  Dropping 'macd_signal_x_volume_return_1h' (VIF: 211199.78)
  Dropping 'lag_48h_price_return_x_volume_return_1h' (VIF: 102324.20)
  Dropping 'volume_return_1h_sq' (VIF: 78528.17)
  Dropping 'volume_r

--- 7. Advanced Feature Engineering (Post-Initial VIF) ---
Starting with 58 features selected by initial VIF.

  --- A. Applying Univariate Transformations ---
  Applying log1p to 9 features...
  Applying sqrt to 5 features...
  Applying square to 10 features...

  --- B. Creating Interaction Terms (Multiplication) ---
  Creating 188 interaction features...
  Successfully created 188 interaction features.

  --- C. Creating Interaction Terms (Division - Selective) ---
  Creating 1 ratio features...

  --- D. Cleaning Engineered Features ---
Total features after advanced engineering: 271

--- 8. Running VIF on Expanded Feature Set ---

--- Feature Selection: Calculating VIF (Threshold: 6.9) ---
Features before VIF check: 271
  Warning: Found 8602 NaNs. Imputing with column medians for VIF calculation ONLY.
  Dropping 'volume_return_1h_x_rolling_std_48h' (VIF: 404620.59)
  Dropping 'macd_signal_x_volume_return_1h' (VIF: 211199.78)
  Dropping 'lag_48h_price_return_x_volume_return_1h' (VIF: 102324.20)
  Dropping 'volume_return_1h_sq' (VIF: 78528.17)
  Dropping 'volume_return_1h_x_rolling_std_168h' (VIF: 59399.18)
  Dropping 'lag_72h_price_return_x_volume_return_1h' (VIF: 41367.88)
  Dropping 'volume_return_1h_x_bband_width_20h' (VIF: 32665.66)
  Dropping 'lag_168h_price_return_x_volume_return_1h' (VIF: 21942.96)
  Dropping 'lag_6h_volume_return_x_rolling_std_48h' (VIF: 16193.96)
  Dropping 'lag_12h_volume_return_x_rolling_std_48h' (VIF: 13149.26)
  Dropping 'lag_3h_volume_return_x_rolling_std_48h' (VIF: 11874.49)
  Dropping 'volume_return_1h_x_rolling_std_6h' (VIF: 9517.35)
  Dropping 'volume_return_1h_x_std12_div_std72' (VIF: 8361.00)
  Dropping 'lag_3h_volume_return_x_rolling_std_168h' (VIF: 4202.18)
  Dropping 'lag_6h_volume_return_x_rolling_std_168h' (VIF: 4140.54)
  Dropping 'lag_12h_volume_return_x_rolling_std_168h' (VIF: 3688.98)
  Dropping 'macd_signal_x_lag_3h_volume_return' (VIF: 3442.17)
  Dropping 'lag_12h_volume_return_x_bband_width_20h' (VIF: 3215.48)
  Dropping 'rolling_std_168h_log1p' (VIF: 3156.73)
  Dropping 'bband_width_20h_sqrt' (VIF: 15963.25)
  Dropping 'lag_48h_price_return_x_lag_3h_volume_return' (VIF: 2773.80)
  Dropping 'macd_hist_x_lag_12h_volume_return' (VIF: 2693.97)
  Dropping 'macd_hist_x_lag_6h_volume_return' (VIF: 2594.53)
  Dropping 'lag_3h_volume_return' (VIF: 2541.81)
  Dropping 'lag_12h_price_return_x_volume_return_1h' (VIF: 2127.71)
  Dropping 'lag_6h_volume_return_x_bband_width_20h' (VIF: 1844.85)
  Dropping 'macd_signal_x_lag_6h_volume_return' (VIF: 1700.96)
  Dropping 'rolling_std_48h_log1p' (VIF: 1497.23)
  Dropping 'lag_168h_price_return_x_lag_6h_volume_return' (VIF: 1459.85)
  Dropping 'lag_168h_price_return_x_lag_12h_volume_return' (VIF: 1324.18)
  Dropping 'macd_hist_x_volume_return_1h' (VIF: 1293.36)
  Dropping 'lag_3h_volume_return_x_bband_width_20h' (VIF: 954.81)
  Dropping 'rolling_std_6h_log1p' (VIF: 794.20)
  Dropping 'lag_168h_price_return_x_lag_3h_volume_return' (VIF: 755.22)
  Dropping 'Volume BTC_log1p' (VIF: 752.67)
  Dropping 'lag_12h_volume_return_x_rolling_std_6h' (VIF: 717.26)
  Dropping 'lag_6h_volume_return_x_rolling_std_6h' (VIF: 696.36)
  Dropping 'lag_12h_volume_return_x_std12_div_std72' (VIF: 508.86)
  Dropping 'macd_hist_x_lag_3h_volume_return' (VIF: 432.61)
  Dropping 'lag_12h_price_return_x_lag_12h_volume_return' (VIF: 406.68)
  Dropping 'rolling_std_48h_sqrt' (VIF: 379.80)
  Dropping 'lag_6h_volume_return' (VIF: 364.92)
  Dropping 'volume_ma_12h_log1p' (VIF: 361.34)
  Dropping 'lag_3h_volume_return_x_std12_div_std72' (VIF: 308.28)
  Dropping 'lag_24h_volume_return' (VIF: 298.66)
  Dropping 'lag_24h_volume_return_x_rolling_std_48h' (VIF: 264.86)
  Dropping 'macd_signal_x_lag_12h_volume_return' (VIF: 256.97)
  Dropping 'lag_48h_price_return_x_lag_6h_volume_return' (VIF: 233.08)
  Dropping 'rolling_std_168h_sqrt' (VIF: 202.52)
  Dropping 'cci_20h_x_volume_return_1h' (VIF: 156.38)
  Dropping 'lag_12h_price_return_x_lag_3h_volume_return' (VIF: 148.14)
  Dropping 'rolling_std_6h_sqrt' (VIF: 139.99)
  Dropping 'lag_48h_price_return_x_lag_12h_volume_return' (VIF: 112.16)
  Dropping 'bband_width_20h_log1p' (VIF: 106.75)
  Dropping 'volume_ma_168h_log1p' (VIF: 99.71)
  Dropping 'lag_24h_price_return_x_volume_ma_12h' (VIF: 97.83)
  Dropping 'lag_24h_volume_return_x_bband_width_20h' (VIF: 94.86)
  Dropping 'lag_48h_price_return_x_volume_ma_12h' (VIF: 88.00)
  Dropping 'cci_20h_x_lag_12h_volume_return' (VIF: 80.29)
  Dropping 'volume_ma_12h_x_rolling_std_48h' (VIF: 78.87)
  Dropping 'lag_12h_price_return_x_lag_24h_volume_return' (VIF: 71.70)
  Dropping 'lag_24h_price_return_x_lag_3h_volume_return' (VIF: 70.60)
  Dropping 'lag_12h_price_return_x_volume_ma_12h' (VIF: 63.35)
  Dropping 'lag_48h_price_return_x_bband_width_20h' (VIF: 62.72)
  Dropping 'lag_24h_volume_return_x_rolling_std_6h' (VIF: 59.18)
  Dropping 'lag_24h_price_return_x_volume_return_1h' (VIF: 56.08)
  Dropping 'lag_24h_price_return_x_Volume BTC' (VIF: 50.76)
  Dropping 'lag_12h_price_return' (VIF: 49.05)
  Dropping 'lag_48h_price_return_x_rolling_std_48h' (VIF: 46.67)
  Dropping 'volume_ma_168h' (VIF: 46.17)
  Dropping 'lag_24h_price_return_x_rolling_std_48h' (VIF: 44.88)
  Dropping 'Volume USD_log1p' (VIF: 43.65)
  Dropping 'lag_72h_price_return_x_volume_ma_12h' (VIF: 43.36)
  Dropping 'lag_12h_price_return_x_lag_6h_volume_return' (VIF: 42.43)
  Dropping 'lag_48h_price_return_x_Volume BTC' (VIF: 39.41)
  Dropping 'lag_24h_price_return' (VIF: 35.87)
  Dropping 'volume_ma_12h_x_rolling_std_168h' (VIF: 34.75)
  Dropping 'lag_24h_price_return_x_bband_width_20h' (VIF: 34.66)
  Dropping 'lag_12h_price_return_x_rolling_std_6h' (VIF: 34.53)
  Dropping 'lag_72h_price_return_x_rolling_std_48h' (VIF: 32.35)
  Dropping 'lag_48h_price_return' (VIF: 31.48)
  Dropping 'lag_168h_price_return_x_lag_24h_volume_return' (VIF: 31.24)
  Dropping 'Volume BTC_x_rolling_std_48h' (VIF: 30.57)
  Dropping 'lag_12h_price_return_x_rolling_std_48h' (VIF: 30.15)
  Dropping 'lag_12h_volume_return_x_rolling_kurt_24h' (VIF: 29.90)
  Dropping 'cci_20h_x_bband_width_20h' (VIF: 28.45)
  Dropping 'lag_12h_price_return_x_std12_div_std72' (VIF: 27.83)
  Dropping 'lag_48h_price_return_x_rolling_std_6h' (VIF: 27.40)
  Dropping 'lag_3h_volume_return_x_rolling_std_6h' (VIF: 27.38)
  Dropping 'lag_72h_price_return_x_bband_width_20h' (VIF: 25.71)
  Dropping 'macd_hist' (VIF: 25.41)
  Dropping 'lag_12h_price_return_x_Volume BTC' (VIF: 25.36)
  Dropping 'cci_20h_x_lag_6h_volume_return' (VIF: 24.64)
  Dropping 'volume_ma_12h' (VIF: 24.54)
  Dropping 'Volume BTC_x_rolling_std_6h' (VIF: 23.99)
  Dropping 'macd_signal_x_rolling_std_168h' (VIF: 23.50)
  Dropping 'macd_signal' (VIF: 21.08)
  Dropping 'lag_24h_price_return_x_std12_div_std72' (VIF: 20.84)
  Dropping 'volume_ma_168h_x_bband_width_20h' (VIF: 20.50)
  Dropping 'volume_div_ma_24h' (VIF: 20.38)
  Dropping 'rolling_std_48h' (VIF: 20.14)
  Dropping 'lag_12h_price_return_x_volume_div_ma_24h' (VIF: 19.72)
  Dropping 'macd_hist_x_lag_24h_volume_return' (VIF: 18.52)
  Dropping 'lag_168h_price_return' (VIF: 18.34)
  Dropping 'macd_hist_x_rolling_std_48h' (VIF: 18.13)
  Dropping 'lag_24h_price_return_x_rolling_std_6h' (VIF: 17.68)
  Dropping 'Volume BTC' (VIF: 17.33)
  Dropping 'lag_168h_price_return_x_volume_ma_12h' (VIF: 17.21)
  Dropping 'lag_72h_price_return_x_Volume BTC' (VIF: 17.06)
  Dropping 'volume_ma_12h_x_std12_div_std72' (VIF: 16.24)
  Dropping 'lag_48h_price_return_x_std12_div_std72' (VIF: 15.89)
  Dropping 'lag_72h_price_return_x_rolling_std_6h' (VIF: 15.72)
  Dropping 'lag_12h_price_return_x_bband_width_20h' (VIF: 14.60)
  Dropping 'bband_width_20h' (VIF: 14.19)
  Dropping 'Volume BTC_x_bband_width_20h' (VIF: 13.46)
  Dropping 'cci_20h_x_rolling_std_6h' (VIF: 13.26)
  Dropping 'cci_20h_x_volume_ma_12h' (VIF: 13.15)
  Dropping 'rolling_std_6h' (VIF: 12.42)
  Dropping 'lag_24h_volume_return_x_rolling_kurt_24h' (VIF: 12.10)
  Dropping 'macd_signal_x_volume_ma_168h' (VIF: 12.02)
  Dropping 'volume_ma_168h_x_rolling_std_6h' (VIF: 11.86)
  Dropping 'lag_24h_price_return_x_volume_div_ma_24h' (VIF: 11.34)
  Dropping 'macd_signal_x_bband_width_20h' (VIF: 11.17)
  Dropping 'cci_20h' (VIF: 10.87)
  Dropping 'lag_72h_price_return' (VIF: 10.58)
  Dropping 'volume_div_ma_24h_x_std12_div_std72' (VIF: 10.58)
  Dropping 'lag_48h_price_return_x_rolling_kurt_24h' (VIF: 10.34)
  Dropping 'lag_168h_price_return_x_rolling_std_48h' (VIF: 10.33)
  Dropping 'lag_72h_price_return_x_lag_24h_volume_return' (VIF: 9.90)
  Dropping 'macd_hist_x_volume_ma_168h' (VIF: 9.88)
  Dropping 'cmf_20h_x_rolling_std_48h' (VIF: 9.77)
  Dropping 'volume_div_ma_24h_x_rolling_std_48h' (VIF: 9.70)
  Dropping 'lag_72h_price_return_x_volume_div_ma_24h' (VIF: 9.32)
  Dropping 'macd_hist_x_bband_width_20h' (VIF: 9.00)
  Dropping 'rolling_std_3h_sq_sqrt' (VIF: 8.99)
  Dropping 'lag_24h_price_return_x_cmf_20h' (VIF: 8.46)
  Dropping 'lag_24h_price_return_x_lag_24h_volume_return' (VIF: 8.45)
  Dropping 'macd_hist_x_rolling_std_168h' (VIF: 8.41)
  Dropping 'volume_ma_168h_x_rolling_std_168h' (VIF: 8.38)
  Dropping 'volume_div_ma_24h_x_bband_width_20h' (VIF: 8.35)
  Dropping 'lag_6h_volume_return_x_std12_div_std72' (VIF: 8.20)
  Dropping 'Volume BTC_x_rolling_kurt_24h' (VIF: 8.03)
  Dropping 'lag_48h_price_return_x_cmf_20h' (VIF: 7.94)
  Dropping 'lag_72h_price_return_x_std12_div_std72' (VIF: 7.52)
  Dropping 'volume_div_ma_24h_x_rolling_std_6h' (VIF: 7.32)
  Dropping 'macd_hist_x_std12_div_std72' (VIF: 7.11)
  Dropping 'macd_signal_x_Volume BTC' (VIF: 6.98)
  Dropping 'lag_24h_price_return_x_rolling_kurt_24h' (VIF: 6.94)
  Max VIF (6.63) is below threshold 6.9. Stopping.
Removed 148 features based on VIF.
  Features Dropped by VIF: ['Volume BTC', 'Volume BTC_log1p', 'Volume BTC_x_bband_width_20h', 'Volume BTC_x_rolling_kurt_24h', 'Volume BTC_x_rolling_std_48h', 'Volume BTC_x_rolling_std_6h', 'Volume USD_log1p', 'bband_width_20h', 'bband_width_20h_log1p', 'bband_width_20h_sqrt', 'cci_20h', 'cci_20h_x_bband_width_20h', 'cci_20h_x_lag_12h_volume_return', 'cci_20h_x_lag_6h_volume_return', 'cci_20h_x_rolling_std_6h', 'cci_20h_x_volume_ma_12h', 'cci_20h_x_volume_return_1h', 'cmf_20h_x_rolling_std_48h', 'lag_12h_price_return', 'lag_12h_price_return_x_Volume BTC', 'lag_12h_price_return_x_bband_width_20h', 'lag_12h_price_return_x_lag_12h_volume_return', 'lag_12h_price_return_x_lag_24h_volume_return', 'lag_12h_price_return_x_lag_3h_volume_return', 'lag_12h_price_return_x_lag_6h_volume_return', 'lag_12h_price_return_x_rolling_std_48h', 'lag_12h_price_return_x_rolling_std_6h', 'lag_12h_price_return_x_std12_div_std72', 'lag_12h_price_return_x_volume_div_ma_24h', 'lag_12h_price_return_x_volume_ma_12h', 'lag_12h_price_return_x_volume_return_1h', 'lag_12h_volume_return_x_bband_width_20h', 'lag_12h_volume_return_x_rolling_kurt_24h', 'lag_12h_volume_return_x_rolling_std_168h', 'lag_12h_volume_return_x_rolling_std_48h', 'lag_12h_volume_return_x_rolling_std_6h', 'lag_12h_volume_return_x_std12_div_std72', 'lag_168h_price_return', 'lag_168h_price_return_x_lag_12h_volume_return', 'lag_168h_price_return_x_lag_24h_volume_return', 'lag_168h_price_return_x_lag_3h_volume_return', 'lag_168h_price_return_x_lag_6h_volume_return', 'lag_168h_price_return_x_rolling_std_48h', 'lag_168h_price_return_x_volume_ma_12h', 'lag_168h_price_return_x_volume_return_1h', 'lag_24h_price_return', 'lag_24h_price_return_x_Volume BTC', 'lag_24h_price_return_x_bband_width_20h', 'lag_24h_price_return_x_cmf_20h', 'lag_24h_price_return_x_lag_24h_volume_return', 'lag_24h_price_return_x_lag_3h_volume_return', 'lag_24h_price_return_x_rolling_kurt_24h', 'lag_24h_price_return_x_rolling_std_48h', 'lag_24h_price_return_x_rolling_std_6h', 'lag_24h_price_return_x_std12_div_std72', 'lag_24h_price_return_x_volume_div_ma_24h', 'lag_24h_price_return_x_volume_ma_12h', 'lag_24h_price_return_x_volume_return_1h', 'lag_24h_volume_return', 'lag_24h_volume_return_x_bband_width_20h', 'lag_24h_volume_return_x_rolling_kurt_24h', 'lag_24h_volume_return_x_rolling_std_48h', 'lag_24h_volume_return_x_rolling_std_6h', 'lag_3h_volume_return', 'lag_3h_volume_return_x_bband_width_20h', 'lag_3h_volume_return_x_rolling_std_168h', 'lag_3h_volume_return_x_rolling_std_48h', 'lag_3h_volume_return_x_rolling_std_6h', 'lag_3h_volume_return_x_std12_div_std72', 'lag_48h_price_return', 'lag_48h_price_return_x_Volume BTC', 'lag_48h_price_return_x_bband_width_20h', 'lag_48h_price_return_x_cmf_20h', 'lag_48h_price_return_x_lag_12h_volume_return', 'lag_48h_price_return_x_lag_3h_volume_return', 'lag_48h_price_return_x_lag_6h_volume_return', 'lag_48h_price_return_x_rolling_kurt_24h', 'lag_48h_price_return_x_rolling_std_48h', 'lag_48h_price_return_x_rolling_std_6h', 'lag_48h_price_return_x_std12_div_std72', 'lag_48h_price_return_x_volume_ma_12h', 'lag_48h_price_return_x_volume_return_1h', 'lag_6h_volume_return', 'lag_6h_volume_return_x_bband_width_20h', 'lag_6h_volume_return_x_rolling_std_168h', 'lag_6h_volume_return_x_rolling_std_48h', 'lag_6h_volume_return_x_rolling_std_6h', 'lag_6h_volume_return_x_std12_div_std72', 'lag_72h_price_return', 'lag_72h_price_return_x_Volume BTC', 'lag_72h_price_return_x_bband_width_20h', 'lag_72h_price_return_x_lag_24h_volume_return', 'lag_72h_price_return_x_rolling_std_48h', 'lag_72h_price_return_x_rolling_std_6h', 'lag_72h_price_return_x_std12_div_std72', 'lag_72h_price_return_x_volume_div_ma_24h', 'lag_72h_price_return_x_volume_ma_12h', 'lag_72h_price_return_x_volume_return_1h', 'macd_hist', 'macd_hist_x_bband_width_20h', 'macd_hist_x_lag_12h_volume_return', 'macd_hist_x_lag_24h_volume_return', 'macd_hist_x_lag_3h_volume_return', 'macd_hist_x_lag_6h_volume_return', 'macd_hist_x_rolling_std_168h', 'macd_hist_x_rolling_std_48h', 'macd_hist_x_std12_div_std72', 'macd_hist_x_volume_ma_168h', 'macd_hist_x_volume_return_1h', 'macd_signal', 'macd_signal_x_Volume BTC', 'macd_signal_x_bband_width_20h', 'macd_signal_x_lag_12h_volume_return', 'macd_signal_x_lag_3h_volume_return', 'macd_signal_x_lag_6h_volume_return', 'macd_signal_x_rolling_std_168h', 'macd_signal_x_volume_ma_168h', 'macd_signal_x_volume_return_1h', 'rolling_std_168h_log1p', 'rolling_std_168h_sqrt', 'rolling_std_3h_sq_sqrt', 'rolling_std_48h', 'rolling_std_48h_log1p', 'rolling_std_48h_sqrt', 'rolling_std_6h', 'rolling_std_6h_log1p', 'rolling_std_6h_sqrt', 'volume_div_ma_24h', 'volume_div_ma_24h_x_bband_width_20h', 'volume_div_ma_24h_x_rolling_std_48h', 'volume_div_ma_24h_x_rolling_std_6h', 'volume_div_ma_24h_x_std12_div_std72', 'volume_ma_12h', 'volume_ma_12h_log1p', 'volume_ma_12h_x_rolling_std_168h', 'volume_ma_12h_x_rolling_std_48h', 'volume_ma_12h_x_std12_div_std72', 'volume_ma_168h', 'volume_ma_168h_log1p', 'volume_ma_168h_x_bband_width_20h', 'volume_ma_168h_x_rolling_std_168h', 'volume_ma_168h_x_rolling_std_6h', 'volume_return_1h_sq', 'volume_return_1h_x_bband_width_20h', 'volume_return_1h_x_rolling_std_168h', 'volume_return_1h_x_rolling_std_48h', 'volume_return_1h_x_rolling_std_6h', 'volume_return_1h_x_std12_div_std72']
Features remaining after VIF check: 123
--- VIF calculation finished. ---

--- 9. Final Selection Summary (Post-Advanced Engineering) ---
Features after initial VIF:              58
Features after advanced eng+transform: 271
Features after final VIF (6.9):      123

Final list of selected feature names (after second VIF pass):
  1. Volume BTC_x_rolling_std_168h
  2. Volume BTC_x_std12_div_std72
  3. Volume USD
  4. cci_20h_sq
  5. cci_20h_x_Volume BTC
  6. cci_20h_x_cmf_20h
  7. cci_20h_x_lag_24h_volume_return
  8. cci_20h_x_lag_3h_volume_return
  9. cci_20h_x_rolling_kurt_24h
  10. cci_20h_x_rolling_std_168h
  11. cci_20h_x_rolling_std_48h
  12. cci_20h_x_std12_div_std72
  13. cci_20h_x_volume_div_ma_24h
  14. cci_20h_x_volume_ma_168h
  15. close_pos_in_range
  16. cmf_20h
  17. cmf_20h_x_bband_width_20h
  18. cmf_20h_x_rolling_kurt_24h
  19. cmf_20h_x_rolling_std_168h
  20. cmf_20h_x_rolling_std_6h
  21. cmf_20h_x_std12_div_std72
  22. day_0
  23. day_1
  24. day_2
  25. day_4
  26. day_5
  27. day_6
  28. hour_0
  29. hour_1
  30. hour_10
  31. hour_11
  32. hour_12
  33. hour_13
  34. hour_14
  35. hour_15
  36. hour_16
  37. hour_17
  38. hour_18
  39. hour_19
  40. hour_2
  41. hour_20
  42. hour_21
  43. hour_22
  44. hour_23
  45. hour_3
  46. hour_4
  47. hour_5
  48. hour_6
  49. hour_8
  50. hour_9
  51. lag_12h_price_return_sq
  52. lag_12h_price_return_x_cmf_20h
  53. lag_12h_price_return_x_rolling_kurt_24h
  54. lag_12h_price_return_x_rolling_std_168h
  55. lag_12h_price_return_x_volume_ma_168h
  56. lag_12h_volume_return
  57. lag_168h_price_return_sq
  58. lag_168h_price_return_x_Volume BTC
  59. lag_168h_price_return_x_bband_width_20h
  60. lag_168h_price_return_x_cmf_20h
  61. lag_168h_price_return_x_rolling_kurt_24h
  62. lag_168h_price_return_x_rolling_std_168h
  63. lag_168h_price_return_x_rolling_std_6h
  64. lag_168h_price_return_x_std12_div_std72
  65. lag_168h_price_return_x_volume_div_ma_24h
  66. lag_168h_price_return_x_volume_ma_168h
  67. lag_24h_price_return_sq
  68. lag_24h_price_return_x_lag_12h_volume_return
  69. lag_24h_price_return_x_lag_6h_volume_return
  70. lag_24h_price_return_x_rolling_std_168h
  71. lag_24h_price_return_x_volume_ma_168h
  72. lag_24h_volume_return_x_rolling_std_168h
  73. lag_24h_volume_return_x_std12_div_std72
  74. lag_3h_volume_return_x_rolling_kurt_24h
  75. lag_48h_price_return_sq
  76. lag_48h_price_return_x_lag_24h_volume_return
  77. lag_48h_price_return_x_rolling_std_168h
  78. lag_48h_price_return_x_volume_div_ma_24h
  79. lag_48h_price_return_x_volume_ma_168h
  80. lag_6h_volume_return_x_rolling_kurt_24h
  81. lag_72h_price_return_sq
  82. lag_72h_price_return_x_cmf_20h
  83. lag_72h_price_return_x_lag_12h_volume_return
  84. lag_72h_price_return_x_lag_3h_volume_return
  85. lag_72h_price_return_x_lag_6h_volume_return
  86. lag_72h_price_return_x_rolling_kurt_24h
  87. lag_72h_price_return_x_rolling_std_168h
  88. lag_72h_price_return_x_volume_ma_168h
  89. macd_hist_sq
  90. macd_hist_x_Volume BTC
  91. macd_hist_x_cmf_20h
  92. macd_hist_x_rolling_kurt_24h
  93. macd_hist_x_rolling_std_6h
  94. macd_hist_x_volume_div_ma_24h
  95. macd_hist_x_volume_ma_12h
  96. macd_signal_sq
  97. macd_signal_x_cmf_20h
  98. macd_signal_x_lag_24h_volume_return
  99. macd_signal_x_rolling_kurt_24h
  100. macd_signal_x_rolling_std_48h
  101. macd_signal_x_rolling_std_6h
  102. macd_signal_x_std12_div_std72
  103. macd_signal_x_volume_div_ma_24h
  104. macd_signal_x_volume_ma_12h
  105. rolling_kurt_24h
  106. rolling_skew_24h
  107. rolling_std_168h
  108. rolling_std_3h_sq
  109. rolling_std_6h_div_rolling_std_48h
  110. std12_div_std72
  111. volume_btc_x_range
  112. volume_btc_x_range_log1p
  113. volume_div_ma_24h_sq
  114. volume_div_ma_24h_x_rolling_kurt_24h
  115. volume_div_ma_24h_x_rolling_std_168h
  116. volume_ma_12h_x_bband_width_20h
  117. volume_ma_12h_x_rolling_kurt_24h
  118. volume_ma_12h_x_rolling_std_6h
  119. volume_ma_168h_x_rolling_kurt_24h
  120. volume_ma_168h_x_rolling_std_48h
  121. volume_ma_168h_x_std12_div_std72
  122. volume_return_1h
  123. volume_return_1h_x_rolling_kurt_24h

Total final selected features: 123

Advanced feature engineering and selection script finished.