# Here is the refactored solution. I have separated the concerns into three distinct layers:
1.  **The Data Contract:** explicit `dataclasses` defining exactly what goes in and comes out.
2.  **The Engine:** A purely mathematical class (`AlphaEngine`) containing the logic, with no widget/plotting dependencies.
3.  **The UI:** A cleaned-up dashboard function that simply sends inputs to the Engine and visualizes the Output

### Following is the reverse chronological fix log (most recent entry is at the top )

```
To see the full dataframe of all tickers (both those that passed and those that failed) for a specific date, we need to capture a snapshot of the universe inside the `_get_eligible_universe` method.

I have updated the **`AlphaEngine`** class below.
```

```
To verify that the relative percentile logic is working, we can modify the `AlphaEngine` to report exactly **how the cutoff was calculated** for the specific start date.

We want to see evidence that:
1.  In earlier years (e.g., 2005), the volume cutoff is lower (e.g., $200k).
2.  In later years (e.g., 2024), the volume cutoff is higher (e.g., $5M).

Here is the updated `AlphaEngine` and `UI` code. I have added a **"Audit Log"** feature. When you run the tool, it will now print exactly what the Dollar Volume Threshold was for that specific day.
```

```
The best way to solve this is to switch from a **Fixed Dollar Threshold** (e.g., "$1 Million") to a **Relative Percentile Threshold** (e.g., "Top 50% of the market").

In 2004, a stock trading $200k might have been in the top 50% of liquid stocks. In 2024, that same $200k is illiquid garbage. Using a percentile automatically adjusts for inflation and market growth over time.

Here is how to modify your code to support this.
```

```
To fix this, we need to pass the **actual** calculated start date (the trading day the engine "snapped" to) back from the `AlphaEngine` to the UI. Then, the UI can compare the *Requested Date* vs. the *Actual Date* and display the warning message if they differ.

Here is the plan:
1.  **Update `EngineOutput`**: Add a `start_date` field to the dataclass.
2.  **Update `AlphaEngine.run`**: Populate this new field with `safe_start_date`.
3.  **Update `plot_walk_forward_analyzer`**: Add logic to compare the user's input date with the engine's returned date and print the "Info" message if they are different.

Here is the updated code (Sections C, D, and E have changed):
```

```
I have updated the `AlphaEngine.run` method. specifically inside the `if inputs.mode == 'Manual List':` block. It now iterates through every manual ticker and performs two checks:
1.  **Existence**: Is the ticker in the database?
2.  **Availability**: Does the ticker have a valid price on the specific `Start Date`?

If any ticker fails, it compiles a specific error message explaining why (e.g., "No price data on start date") and aborts the calculation immediately.  
```

```
The `snapshot_df` contains **every single feature** calculated by your `generate_features` function for that specific day, plus the new audit columns we added.

Here is exactly what is inside that DataFrame:

### 1. The Core Features (from `generate_features`)
*   **`TR`**: True Range
*   **`ATR`**: Average True Range
*   **`ATRP`**: Average True Range Percent (Volatility)
*   **`RollingStalePct`**: How often the price didn't move or volume was 0.
*   **`RollMedDollarVol`**: Median Daily Dollar Volume (Liquidity).
*   **`RollingSameVolCount`**: Data quality check for repeated volume numbers.

### 2. The Audit Columns (Added during filtering)
*   **`Calculated_Cutoff`**: The specific dollar amount required to pass on that day.
*   **`Passed_Vol_Check`**: `True` if the ticker met the liquidity requirement.
*   **`Passed_Final`**: `True` if it passed **all** checks (Liquidity + Stale + Quality).

=========================================

Here are the formulas translated directly into the Python `pandas` code used in your `generate_features` function.

I have simplified the code slightly to assume a single ticker context (removing the `groupby` wrapper) so you can see the raw math clearly.

### 1. True Range (TR)
Calculates the maximum of the three price differences.

prev_close = df_ohlcv['Adj Close'].shift(1)

# The three components
diff1 = df_ohlcv['Adj High'] - df_ohlcv['Adj Low']
diff2 = (df_ohlcv['Adj High'] - prev_close).abs()
diff3 = (df_ohlcv['Adj Low'] - prev_close).abs()

# Taking the max of the three
tr = pd.concat([diff1, diff2, diff3], axis=1).max(axis=1)

### 2. Average True Range (ATR)
Uses an Exponential Weighted Mean (EWM) with a specific alpha smoothing factor.

# N = atr_period (e.g., 14)
# alpha = 1 / N
atr = tr.ewm(alpha=1/14, adjust=False).mean()

### 3. ATR Percent (ATRP)
Simple division to normalize volatility.

atrp = atr / df_ohlcv['Adj Close']

### 4. Rolling Stale Percentage
Checks if volume is 0 OR if High equals Low (price didn't move), then averages that 1 or 0 signal over the window.

# 1. Define the Stale Signal (1 for stale, 0 for active)
is_stale = np.where(
    (df_ohlcv['Volume'] == 0) | (df_ohlcv['Adj High'] == df_ohlcv['Adj Low']), 
    1,  
    0
)

# 2. Calculate average over window (W=252)
rolling_stale_pct = pd.Series(is_stale).rolling(window=252).mean()

### 5. Rolling Median Dollar Volume
Calculates raw dollar volume, then finds the median over the window.

# 1. Calculate Daily Dollar Volume
dollar_volume = df_ohlcv['Adj Close'] * df_ohlcv['Volume']

# 2. Get Median over window (W=252)
roll_med_dollar_vol = dollar_volume.rolling(window=252).median()

### 6. Rolling Same Volume Count
Checks if today's volume is exactly the same as yesterday's (a sign of bad data), then sums those occurrences.

# 1. Check if Volume(t) - Volume(t-1) equals 0
# .diff() calculates current row minus previous row
has_same_volume = (df_ohlcv['Volume'].diff() == 0).astype(int)

# 2. Sum the errors over window (W=252)
rolling_same_vol_count = has_same_volume.rolling(window=252).sum()

```

================================  

In [9]:
df_ohlcv.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9657171 entries, ('A', Timestamp('1999-11-18 00:00:00')) to ('ZWS', Timestamp('2025-12-03 00:00:00'))
Data columns (total 5 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Adj Open   float64
 1   Adj High   float64
 2   Adj Low    float64
 3   Adj Close  float64
 4   Volume     int64  
dtypes: float64(4), int64(1)
memory usage: 405.9+ MB


In [37]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import ipywidgets as widgets
import pprint

from dataclasses import dataclass, field
from typing import List, Dict, Optional, Any, Union
from collections import Counter
from datetime import datetime, date

# pd.set_option('display.max_rows', None)  display all rows
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 50)
pd.set_option('display.precision', 4)


# ==============================================================================
# SECTION A: CORE HELPER FUNCTIONS & FEATURE GENERATION
# (Unchanged from previous version)
# ==============================================================================
# ... (Keep generate_features, calculate_gain, calculate_sharpe, 
#      calculate_sharpe_atr, calculate_buy_and_hold_performance as is) ...

def generate_features(df_ohlcv: pd.DataFrame, atr_period: int = 14, quality_window: int = 252, quality_min_periods: int = 126) -> pd.DataFrame:
    # (Same as before)
    if not df_ohlcv.index.is_monotonic_increasing: df_ohlcv = df_ohlcv.sort_index()
    grouped = df_ohlcv.groupby(level='Ticker')
    prev_close = grouped['Adj Close'].shift(1)
    tr = pd.concat([df_ohlcv['Adj High'] - df_ohlcv['Adj Low'], abs(df_ohlcv['Adj High'] - prev_close), abs(df_ohlcv['Adj Low'] - prev_close)], axis=1).max(axis=1, skipna=False)
    atr = tr.groupby(level='Ticker').transform(lambda x: x.ewm(alpha=1/atr_period, adjust=False).mean())
    atrp = (atr / df_ohlcv['Adj Close']).replace([np.inf, -np.inf], np.nan)
    indicator_df = pd.DataFrame({'TR': tr, 'ATR': atr, 'ATRP': atrp})
    quality_temp_df = pd.DataFrame({'IsStale': np.where((df_ohlcv['Volume'] == 0) | (df_ohlcv['Adj High'] == df_ohlcv['Adj Low']), 1, 0), 'DollarVolume': df_ohlcv['Adj Close'] * df_ohlcv['Volume'], 'HasSameVolume': (grouped['Volume'].diff() == 0).astype(int)}, index=df_ohlcv.index)
    rolling_result = quality_temp_df.groupby(level='Ticker').rolling(window=quality_window, min_periods=quality_min_periods).agg({'IsStale': 'mean', 'DollarVolume': 'median', 'HasSameVolume': 'sum'}).rename(columns={'IsStale': 'RollingStalePct', 'DollarVolume': 'RollMedDollarVol', 'HasSameVolume': 'RollingSameVolCount'}).reset_index(level=0, drop=True)
    return pd.concat([indicator_df, rolling_result], axis=1)

def calculate_gain(price_series): 
    if price_series.dropna().shape[0] < 2: return np.nan
    return (price_series.ffill().iloc[-1] / price_series.bfill().iloc[0]) - 1

def calculate_sharpe(return_series):
    if return_series.dropna().shape[0] < 2: return np.nan
    std = return_series.std()
    return (return_series.mean() / std * np.sqrt(252)) if std > 0 else 0.0

def calculate_sharpe_atr(return_series, atrp_series):
    if return_series.dropna().shape[0] < 2 or atrp_series.dropna().empty: return np.nan
    mean_atrp = atrp_series.mean()
    return (return_series.mean() / mean_atrp) if mean_atrp > 0 else 0.0

def calculate_buy_and_hold_performance(df_close, features_df, tickers, start_date, end_date):
    if not tickers: return pd.Series(dtype=float), pd.Series(dtype=float), pd.Series(dtype=float)
    ticker_counts = Counter(tickers)
    initial_weights = pd.Series({t: c / len(tickers) for t, c in ticker_counts.items()})
    prices_raw = df_close[initial_weights.index.tolist()].loc[start_date:end_date]
    if prices_raw.dropna(how='all').empty: return pd.Series(dtype=float), pd.Series(dtype=float), pd.Series(dtype=float)
    prices_norm = prices_raw.div(prices_raw.bfill().iloc[0])
    weighted_growth = prices_norm.mul(initial_weights, axis='columns')
    value_series = weighted_growth.sum(axis=1)
    return_series = value_series.pct_change()
    full_idx = pd.MultiIndex.from_product([initial_weights.index.tolist(), return_series.index], names=['Ticker', 'Date'])
    feat_subset = features_df.reindex(full_idx)['ATRP'].unstack(level='Ticker')
    atrp_series = (weighted_growth.div(value_series, axis='index').align(feat_subset, join='inner', axis=1)[0] * weighted_growth.div(value_series, axis='index').align(feat_subset, join='inner', axis=1)[1]).sum(axis=1)
    return value_series, return_series, atrp_series

# ==============================================================================
# SECTION B: METRIC REGISTRY
# ==============================================================================

def metric_price(d): return calculate_gain(d['calc_close'])
def metric_sharpe(d): 
    r = d['daily_returns']
    return (r.mean() / r.std() * np.sqrt(252)).replace([np.inf, -np.inf], np.nan).fillna(0)
def metric_sharpe_atr(d):
    return (d['daily_returns'].mean() / d['atrp']).replace([np.inf, -np.inf], np.nan).fillna(0)

METRIC_REGISTRY = {
    'Price': metric_price,
    'Sharpe': metric_sharpe,
    'Sharpe (ATR)': metric_sharpe_atr,
}

# ==============================================================================
# SECTION C: DATA CONTRACTS (The API)
# Updated EngineOutput to include actual start_date
# ==============================================================================

@dataclass
class EngineInput:
    mode: str
    start_date: pd.Timestamp
    calc_period: int
    fwd_period: int
    metric: str
    benchmark_ticker: str
    rank_start: int = 1
    rank_end: int = 10
    quality_thresholds: Dict[str, float] = field(default_factory=lambda: {'min_median_dollar_volume': 1_000_000, 'max_stale_pct': 0.05, 'max_same_vol_count': 10})
    manual_tickers: List[str] = field(default_factory=list)
    debug: bool = False

@dataclass
class EngineOutput:
    portfolio_series: pd.Series
    benchmark_series: pd.Series
    normalized_plot_data: pd.DataFrame
    tickers: List[str]
    initial_weights: pd.Series
    perf_metrics: Dict[str, float]
    results_df: pd.DataFrame
    start_date: pd.Timestamp # <--- NEW FIELD: The actual trading start date used
    calc_end_date: pd.Timestamp
    viz_end_date: pd.Timestamp
    error_msg: Optional[str] = None
    debug_data: Optional[Dict[str, Any]] = None

# ==============================================================================
# SECTION D: THE ALPHA ENGINE (The "Brain")
# This version saves a sorted dataframe called `universe_snapshot` into the debug data. It adds columns showing exactly which tickers passed or failed the specific thresholds.
# ==============================================================================

class AlphaEngine:
    def __init__(self, df_ohlcv: pd.DataFrame, master_ticker: str = 'SPY'):
        print("--- ‚öôÔ∏è Initializing AlphaEngine ---")
        self.features_df = generate_features(df_ohlcv)
        print("Optimizing data structures...")
        self.df_close = df_ohlcv['Adj Close'].unstack(level=0)
        
        if master_ticker not in self.df_close.columns:
            master_ticker = self.df_close.columns[0]
            print(f"Warning: Master ticker not found. Using {master_ticker}")
            
        self.trading_calendar = self.df_close[master_ticker].dropna().index.unique().sort_values()
        print("‚úÖ AlphaEngine Ready.")

    def run(self, inputs: EngineInput) -> EngineOutput:
        # --- A. Validate Dates ---
        try:
            start_idx = self.trading_calendar.searchsorted(inputs.start_date)
            if start_idx < 0: start_idx = 0
        except Exception:
            return self._error_result("Invalid Start Date")

        desired_end_idx = start_idx + inputs.calc_period + inputs.fwd_period
        if desired_end_idx >= len(self.trading_calendar):
            return self._error_result(f"Date range exceeds history.")

        safe_start_date = self.trading_calendar[start_idx]
        safe_calc_end_date = self.trading_calendar[start_idx + inputs.calc_period]
        safe_viz_end_date = self.trading_calendar[start_idx + inputs.calc_period + inputs.fwd_period]

        # --- B. Select Tickers ---
        tickers_to_trade = []
        results_table = pd.DataFrame()
        debug_dict = {}
        audit_info = {} 

        if inputs.mode == 'Manual List':
            validation_errors = []
            valid_tickers = []
            for t in inputs.manual_tickers:
                if t not in self.df_close.columns:
                    validation_errors.append(f"‚ùå {t}: Ticker not found.")
                    continue
                if pd.isna(self.df_close.at[safe_start_date, t]):
                    validation_errors.append(f"‚ö†Ô∏è {t}: No price data on start date.")
                    continue
                valid_tickers.append(t)
            
            if validation_errors: return self._error_result("\n".join(validation_errors))
            if not valid_tickers: return self._error_result("No valid tickers.")
            tickers_to_trade = valid_tickers
            results_table = pd.DataFrame(index=valid_tickers)
            
        else: # Ranking Mode
            eligible_tickers = self._get_eligible_universe(safe_start_date, inputs.quality_thresholds, audit_info)
            debug_dict['audit_liquidity'] = audit_info 
            
            if not eligible_tickers: return self._error_result("No tickers passed quality filters.")
            
            calc_close = self.df_close.loc[safe_start_date:safe_calc_end_date, eligible_tickers]
            idx_product = pd.MultiIndex.from_product([eligible_tickers, calc_close.index], names=['Ticker', 'Date'])
            feat_slice = self.features_df.reindex(idx_product).dropna(how='all')
            atrp_mean = feat_slice.groupby(level='Ticker')['ATRP'].mean()
            
            ingredients = { 'calc_close': calc_close, 'daily_returns': calc_close.pct_change(), 'atrp': atrp_mean }
            if inputs.metric not in METRIC_REGISTRY: return self._error_result(f"Metric '{inputs.metric}' not found.")
            metric_vals = METRIC_REGISTRY[inputs.metric](ingredients)
            sorted_tickers = metric_vals.sort_values(ascending=False)
            
            start_r = max(0, inputs.rank_start - 1)
            end_r = inputs.rank_end
            tickers_to_trade = sorted_tickers.iloc[start_r:end_r].index.tolist()
            if not tickers_to_trade: return self._error_result("No tickers generated from ranking.")

            results_table = pd.DataFrame({
                'Rank': range(inputs.rank_start, inputs.rank_start + len(tickers_to_trade)),
                'Ticker': tickers_to_trade,
                'Metric Value': sorted_tickers.loc[tickers_to_trade].values
            }).set_index('Ticker')

        # --- C. Performance Calculations ---
        p_val, p_ret, p_atrp = calculate_buy_and_hold_performance(self.df_close, self.features_df, tickers_to_trade, safe_start_date, safe_viz_end_date)
        b_val, b_ret, b_atrp = calculate_buy_and_hold_performance(self.df_close, self.features_df, [inputs.benchmark_ticker], safe_start_date, safe_viz_end_date)

        # --- D. Final Metrics ---
        plot_data = self.df_close[list(set(tickers_to_trade))].loc[safe_start_date:safe_viz_end_date]
        if not plot_data.empty: plot_data = plot_data / plot_data.bfill().iloc[0]
        calc_end_ts = safe_calc_end_date
        metrics = {}
        get_gain = lambda s: (s.iloc[-1] / s.iloc[0]) - 1 if len(s) > 0 else 0

        metrics['full_p_gain'] = get_gain(p_val)
        metrics['calc_p_gain'] = get_gain(p_val.loc[:calc_end_ts])
        metrics['fwd_p_gain'] = get_gain(p_val.loc[calc_end_ts:])
        metrics['full_p_sharpe_atr'] = calculate_sharpe_atr(p_ret, p_atrp)
        metrics['calc_p_sharpe_atr'] = calculate_sharpe_atr(p_ret.loc[:calc_end_ts], p_atrp.loc[p_ret.loc[:calc_end_ts].index])
        metrics['fwd_p_sharpe_atr'] = calculate_sharpe_atr(p_ret.loc[calc_end_ts:].iloc[1:], p_atrp.loc[p_ret.loc[calc_end_ts:].iloc[1:].index])
        
        if not b_ret.empty:
            metrics['full_b_gain'] = get_gain(b_val)
            metrics['calc_b_gain'] = get_gain(b_val.loc[:calc_end_ts])
            metrics['fwd_b_gain'] = get_gain(b_val.loc[calc_end_ts:])
            metrics['full_b_sharpe_atr'] = calculate_sharpe_atr(b_ret, b_atrp)
            metrics['calc_b_sharpe_atr'] = calculate_sharpe_atr(b_ret.loc[:calc_end_ts], b_atrp.loc[b_ret.loc[:calc_end_ts].index])
            metrics['fwd_b_sharpe_atr'] = calculate_sharpe_atr(b_ret.loc[calc_end_ts:].iloc[1:], b_atrp.loc[b_ret.loc[calc_end_ts:].iloc[1:].index])

        if not plot_data.empty: results_table['Fwd Gain'] = (plot_data.iloc[-1] / plot_data.loc[calc_end_ts]) - 1
        ticker_counts = Counter(tickers_to_trade)
        weights = pd.Series({t: c/len(tickers_to_trade) for t, c in ticker_counts.items()})

        if inputs.debug:
            trace_df = plot_data.copy()
            trace_df.columns = [f'Norm_Price_{c}' for c in trace_df.columns]
            trace_df['Norm_Price_Portfolio'] = p_val
            if not b_val.empty: trace_df[f'Norm_Price_Benchmark_{inputs.benchmark_ticker}'] = b_val
            debug_dict['portfolio_trace'] = trace_df

        return EngineOutput(
            portfolio_series=p_val, benchmark_series=b_val, normalized_plot_data=plot_data,
            tickers=tickers_to_trade, initial_weights=weights, perf_metrics=metrics,
            results_df=results_table, start_date=safe_start_date,
            calc_end_date=safe_calc_end_date, viz_end_date=safe_viz_end_date, debug_data=debug_dict
        )

    # --- UPDATED: CAPTURE SNAPSHOT ---
    def _get_eligible_universe(self, date_ts, thresholds, audit_container=None):
        avail_dates = self.features_df.index.get_level_values('Date').unique().sort_values()
        valid_dates = avail_dates[avail_dates <= date_ts]
        if valid_dates.empty: return []
        day_features = self.features_df.xs(valid_dates[-1], level='Date')

        # 1. Determine Dynamic Cutoff
        vol_cutoff = thresholds.get('min_median_dollar_volume', 0)
        percentile_used = "N/A"
        dynamic_val = 0
        
        if 'min_liquidity_percentile' in thresholds:
            percentile_used = thresholds['min_liquidity_percentile']
            dynamic_val = day_features['RollMedDollarVol'].quantile(percentile_used)
            vol_cutoff = max(vol_cutoff, dynamic_val)

        # 2. Logic Mask
        mask = (
            (day_features['RollMedDollarVol'] >= vol_cutoff) &
            (day_features['RollingStalePct'] <= thresholds['max_stale_pct']) &
            (day_features['RollingSameVolCount'] <= thresholds['max_same_vol_count'])
        )

        # 3. Capture Detailed Audit Snapshot
        if audit_container is not None:
            audit_container['date'] = valid_dates[-1]
            audit_container['total_tickers_available'] = len(day_features)
            audit_container['percentile_setting'] = percentile_used
            audit_container['percentile_value_usd'] = dynamic_val
            audit_container['final_cutoff_usd'] = vol_cutoff
            audit_container['tickers_passed'] = mask.sum()
            
            # Save the DataFrame!
            snapshot = day_features.copy()
            snapshot['Calculated_Cutoff'] = vol_cutoff
            snapshot['Passed_Vol_Check'] = snapshot['RollMedDollarVol'] >= vol_cutoff
            snapshot['Passed_Final'] = mask
            # Sort by volume so user can see the cutoff point easily
            snapshot = snapshot.sort_values('RollMedDollarVol', ascending=False)
            audit_container['universe_snapshot'] = snapshot

        return day_features[mask].index.tolist()

    def _error_result(self, msg):
        return EngineOutput(pd.Series(dtype=float), pd.Series(dtype=float), pd.DataFrame(), [], pd.Series(dtype=float), {}, pd.DataFrame(), pd.Timestamp.min, pd.Timestamp.min, pd.Timestamp.min, msg)

# ==============================================================================
# SECTION E: THE UI (Visualization)
# Update this function to read the audit data from the `debug_data` and print it nicely.
# Updated print logic to detect date shift
# Fixed EngineInput argument mapping
# ==============================================================================

def plot_walk_forward_analyzer(df_ohlcv, 
                               default_start_date='2020-01-01', 
                               default_calc_period=126, 
                               default_fwd_period=63,
                               default_metric='Sharpe (ATR)', 
                               default_rank_start=1, 
                               default_rank_end=10,
                               default_benchmark_ticker='SPY', 
                               master_calendar_ticker='SPY', 
                               quality_thresholds=None, 
                               debug=False):
    
    engine = AlphaEngine(df_ohlcv, master_ticker=master_calendar_ticker)
    results_container = [None]
    debug_container = [None]

    # --- UPDATED DEFAULT SETTINGS WITH PERCENTILE ---
    if quality_thresholds is None:
        quality_thresholds = {
            'min_median_dollar_volume': 100_000, # Hard floor
            'min_liquidity_percentile': 0.50,    # Top 50%
            'max_stale_pct': 0.05, 
            'max_same_vol_count': 10
        }

    # (Widget setup code remains the same...)
    mode_selector = widgets.RadioButtons(options=['Ranking', 'Manual List'], value='Ranking', description='Portfolio Mode:', layout={'width': 'max-content'})
    start_date_picker = widgets.DatePicker(description='Start Date:', value=pd.to_datetime(default_start_date))
    calc_period_input = widgets.IntText(value=default_calc_period, description='Calc Period:')
    fwd_period_input = widgets.IntText(value=default_fwd_period, description='Fwd Period:')
    metric_dropdown = widgets.Dropdown(options=list(METRIC_REGISTRY.keys()), value=default_metric, description='Metric:')
    rank_start_input = widgets.IntText(value=default_rank_start, description='Rank Start:')
    rank_end_input = widgets.IntText(value=default_rank_end, description='Rank End:')
    manual_tickers_input = widgets.Textarea(value='', placeholder='Enter tickers...', description='Manual Tickers:', layout={'width': '400px', 'height': '80px'})
    benchmark_input = widgets.Text(value=default_benchmark_ticker, description='Benchmark:', placeholder='Enter Ticker')
    update_button = widgets.Button(description="Update Chart", button_style='primary')
    ticker_list_output = widgets.Output()

    ranking_controls = widgets.HBox([metric_dropdown, rank_start_input, rank_end_input])
    manual_controls = widgets.HBox([manual_tickers_input])
    date_controls = widgets.HBox([start_date_picker, calc_period_input, fwd_period_input])
    ui = widgets.VBox([mode_selector, date_controls, ranking_controls, manual_controls, widgets.HBox([benchmark_input, update_button]), ticker_list_output], layout=widgets.Layout(margin='10px 0 20px 0'))
    
    def on_mode_change(c):
        ranking_controls.layout.display = 'flex' if c['new'] == 'Ranking' else 'none'
        manual_controls.layout.display = 'none' if c['new'] == 'Ranking' else 'flex'
    mode_selector.observe(on_mode_change, names='value')
    on_mode_change({'new': mode_selector.value})

    fig = go.FigureWidget()
    fig.update_layout(title='Walk-Forward Performance Analysis', height=600, template="plotly_white", hovermode='x unified')
    for i in range(50): fig.add_trace(go.Scatter(visible=False, line=dict(width=2)))
    fig.add_trace(go.Scatter(name='Benchmark', visible=True, line=dict(color='black', width=3, dash='dash')))
    fig.add_trace(go.Scatter(name='Group Portfolio', visible=True, line=dict(color='green', width=3)))

    def update_plot(b):
        ticker_list_output.clear_output()
        manual_list = [t.strip().upper() for t in manual_tickers_input.value.split(',') if t.strip()]
        start_date_raw = pd.to_datetime(start_date_picker.value)
        
        if start_date_raw < (engine.trading_calendar[0] - pd.Timedelta(days=7)):
            with ticker_list_output: print(f"‚ö†Ô∏è DATE WARNING: Start date {start_date_raw.date()} is too early."); return

        inputs = EngineInput(
            mode=mode_selector.value,
            start_date=start_date_raw,
            calc_period=calc_period_input.value,
            fwd_period=fwd_period_input.value,
            metric=metric_dropdown.value,
            benchmark_ticker=benchmark_input.value.strip().upper(),
            rank_start=rank_start_input.value,
            rank_end=rank_end_input.value,
            quality_thresholds=quality_thresholds,
            manual_tickers=manual_list,
            debug=debug
        )
        
        with ticker_list_output:
            res = engine.run(inputs)
            results_container[0] = res
            debug_container[0] = res.debug_data
            if res.error_msg: print(res.error_msg); return

            with fig.batch_update():
                cols = res.normalized_plot_data.columns.tolist()
                for i in range(50):
                    if i < len(cols): fig.data[i].update(x=res.normalized_plot_data.index, y=res.normalized_plot_data[cols[i]], name=cols[i], visible=True)
                    else: fig.data[i].visible = False
                
                fig.data[50].update(x=res.benchmark_series.index, y=res.benchmark_series.values, name=f"Benchmark ({inputs.benchmark_ticker})", visible=not res.benchmark_series.empty)
                fig.data[51].update(x=res.portfolio_series.index, y=res.portfolio_series.values, visible=True)
                fig.layout.shapes = [dict(type="line", x0=res.calc_end_date, y0=0, x1=res.calc_end_date, y1=1, xref='x', yref='paper', line=dict(color="grey", width=2, dash="dash"))]

            req_date = inputs.start_date.date()
            act_date = res.start_date.date()
            if req_date != act_date: print(f"‚ÑπÔ∏è Info: Start date {req_date} is not a trading day. Snapping forward to {act_date}.")
            
            # --- LIQUIDITY AUDIT PRINT ---
            if inputs.mode == 'Ranking' and res.debug_data and 'audit_liquidity' in res.debug_data:
                audit = res.debug_data['audit_liquidity']
                if audit:
                    pct_str = f"{audit.get('percentile_setting', 0)*100:.0f}%"
                    cut_val = audit.get('final_cutoff_usd', 0)
                    print("-" * 60)
                    print(f"üîç LIQUIDITY CHECK ({act_date})")
                    print(f"   Universe Size: {audit.get('total_tickers_available')} tickers")
                    print(f"   Filtering: Top {pct_str} of Market")
                    print(f"   Calculated Cutoff: ${cut_val:,.0f} / day")
                    print(f"   Tickers Remaining: {audit.get('tickers_passed')}")
                    print("-" * 60)
            
            print(f"Analysis Period: {act_date} to {res.viz_end_date.date()}.")
            
            if inputs.mode == 'Ranking': print("Ranked Tickers:"); pprint.pprint(res.tickers)
            else: print("Manual Portfolio Tickers:"); pprint.pprint(res.tickers)
            
            m = res.perf_metrics
            rows = [
                {'Metric': 'Group Portfolio Gain', 'Full': m.get('full_p_gain'), 'Calc': m.get('calc_p_gain'), 'Fwd': m.get('fwd_p_gain')},
                {'Metric': f'Benchmark ({inputs.benchmark_ticker}) Gain', 'Full': m.get('full_b_gain'), 'Calc': m.get('calc_b_gain'), 'Fwd': m.get('fwd_b_gain')},
                {'Metric': '== Gain Delta', 'Full': m.get('full_p_gain',0)-m.get('full_b_gain',0), 'Calc': m.get('calc_p_gain',0)-m.get('calc_b_gain',0), 'Fwd': m.get('fwd_p_gain',0)-m.get('fwd_b_gain',0)},
                {'Metric': 'Group Sharpe (ATR)', 'Full': m.get('full_p_sharpe_atr'), 'Calc': m.get('calc_p_sharpe_atr'), 'Fwd': m.get('fwd_p_sharpe_atr')},
                {'Metric': f'Benchmark Sharpe (ATR)', 'Full': m.get('full_b_sharpe_atr'), 'Calc': m.get('calc_b_sharpe_atr'), 'Fwd': m.get('fwd_b_sharpe_atr')},
                {'Metric': '== Sharpe Delta', 'Full': m.get('full_p_sharpe_atr',0)-m.get('full_b_sharpe_atr',0), 'Calc': m.get('calc_p_sharpe_atr',0)-m.get('calc_b_sharpe_atr',0), 'Fwd': m.get('fwd_p_sharpe_atr',0)-m.get('fwd_b_sharpe_atr',0)}
            ]
            display(pd.DataFrame(rows).set_index('Metric').style.format("{:+.2%}", na_rep="N/A"))

    update_button.on_click(update_plot)
    update_plot(None)
    display(ui, fig)
    return results_container, debug_container

# ==============================================================================
# SECTION F: UTILITIES
# ==============================================================================

def print_nested(d, indent=0, width=4):
    """Pretty-print any nested dict/list/tuple combination."""
    spacing = ' ' * indent
    if isinstance(d, dict):
        for k, v in d.items():
            print(f'{spacing}{k}:')
            print_nested(v, indent + width, width)
    elif isinstance(d, (list, tuple)):
        for item in d:
            print_nested(item, indent, width)
    else:
        print(f'{spacing}{d}')

def get_ticker_OHLCV(
    df_ohlcv: pd.DataFrame,
    tickers: Union[str, List[str]],
    date_start: str,
    date_end: str,
    return_format: str = "dataframe",
    verbose: bool = True
) -> Union[pd.DataFrame, dict]:
    """
    Get OHLCV data for specified tickers within a date range.
    
    Parameters
    ----------
    df_ohlcv : pd.DataFrame
        DataFrame with MultiIndex of (ticker, date) and OHLCV columns
    tickers : str or list of str
        Ticker symbol(s) to retrieve
    date_start : str
        Start date in 'YYYY-MM-DD' format
    date_end : str
        End date in 'YYYY-MM-DD' format
    return_format : str, optional
        Format to return data in. Options: 
        - 'dataframe': Single DataFrame with MultiIndex (default)
        - 'dict': Dictionary with tickers as keys and DataFrames as values
        - 'separate': List of separate DataFrames for each ticker
    verbose : bool, optional
        Whether to print summary information (default: True)
    
    Returns
    -------
    Union[pd.DataFrame, dict, list]
        Filtered OHLCV data in specified format
    
    Raises
    ------
    ValueError
        If input parameters are invalid
    KeyError
        If tickers not found in DataFrame
    
    Examples
    --------
    >>> # Get data for single ticker
    >>> vlo_data = get_ticker_OHLCV(df_ohlcv, 'VLO', '2025-08-13', '2025-09-04')
    
    >>> # Get data for multiple tickers
    >>> multi_data = get_ticker_OHLCV(df_ohlcv, ['VLO', 'JPST'], '2025-08-13', '2025-09-04')
    
    >>> # Get data as dictionary
    >>> data_dict = get_ticker_OHLCV(df_ohlcv, ['VLO', 'JPST'], '2025-08-13', 
    ...                              '2025-09-04', return_format='dict')
    """
    
    # Input validation
    if not isinstance(df_ohlcv, pd.DataFrame):
        raise TypeError("df_ohlcv must be a pandas DataFrame")
    
    if not isinstance(df_ohlcv.index, pd.MultiIndex):
        raise ValueError("DataFrame must have MultiIndex of (ticker, date)")
    
    if len(df_ohlcv.index.levels) != 2:
        raise ValueError("MultiIndex must have exactly 2 levels: (ticker, date)")
    
    # Convert single ticker to list for consistent processing
    if isinstance(tickers, str):
        tickers = [tickers]
    elif not isinstance(tickers, list):
        raise TypeError("tickers must be a string or list of strings")
    
    # Convert dates to Timestamps
    try:
        start_date = pd.Timestamp(date_start)
        end_date = pd.Timestamp(date_end)
    except ValueError as e:
        raise ValueError(f"Invalid date format. Use 'YYYY-MM-DD': {e}")
    
    if start_date > end_date:
        raise ValueError("date_start must be before or equal to date_end")
    
    # Check if tickers exist in the DataFrame
    available_tickers = df_ohlcv.index.get_level_values(0).unique()
    missing_tickers = [t for t in tickers if t not in available_tickers]
    
    if missing_tickers:
        raise KeyError(f"Ticker(s) not found in DataFrame: {missing_tickers}")
    
    # Filter the data using MultiIndex slicing
    try:
        filtered_data = df_ohlcv.loc[(tickers, slice(date_start, date_end)), :]
    except Exception as e:
        raise ValueError(f"Error filtering data: {e}")
    
    # Handle empty results
    if filtered_data.empty:
        if verbose:
            print(f"No data found for tickers {tickers} in date range {date_start} to {date_end}")
        return filtered_data
    
    # Print summary if verbose
    if verbose:
        print(f"Data retrieved for {len(tickers)} ticker(s) from {date_start} to {date_end}")
        print(f"Total rows: {len(filtered_data)}")
        print(f"Date range in data: {filtered_data.index.get_level_values(1).min()} to "
              f"{filtered_data.index.get_level_values(1).max()}")
        
        # Print ticker-specific counts
        ticker_counts = filtered_data.index.get_level_values(0).value_counts()
        for ticker in tickers:
            count = ticker_counts.get(ticker, 0)
            if count > 0:
                print(f"  {ticker}: {count} rows")
            else:
                print(f"  {ticker}: No data in range")
    
    # Return in requested format
    if return_format == "dict":
        result = {}
        for ticker in tickers:
            try:
                result[ticker] = filtered_data.xs(ticker, level=0).loc[date_start:date_end]
            except KeyError:
                result[ticker] = pd.DataFrame()
        return result
    
    elif return_format == "separate":
        result = []
        for ticker in tickers:
            try:
                result.append(filtered_data.xs(ticker, level=0).loc[date_start:date_end])
            except KeyError:
                result.append(pd.DataFrame())
        return result
    
    elif return_format == "dataframe":
        return filtered_data
    
    else:
        raise ValueError(f"Invalid return_format: {return_format}. "
                         f"Must be 'dataframe', 'dict', or 'separate'")

def get_ticker_features(
    features_df: pd.DataFrame,
    tickers: Union[str, List[str]],
    date_start: str,
    date_end: str,
    return_format: str = "dataframe",
    verbose: bool = True
) -> Union[pd.DataFrame, dict]:
    """
    Get features data for specified tickers within a date range.
    
    Parameters
    ----------
    features_df : pd.DataFrame
        DataFrame with MultiIndex of (ticker, date) and feature columns
    tickers : str or list of str
        Ticker symbol(s) to retrieve
    date_start : str
        Start date in 'YYYY-MM-DD' format
    date_end : str
        End date in 'YYYY-MM-DD' format
    return_format : str, optional
        Format to return data in. Options: 
        - 'dataframe': Single DataFrame with MultiIndex (default)
        - 'dict': Dictionary with tickers as keys and DataFrames as values
        - 'separate': List of separate DataFrames for each ticker
    verbose : bool, optional
        Whether to print summary information (default: True)
    
    Returns
    -------
    Union[pd.DataFrame, dict, list]
        Filtered features data in specified format
    """
    # Convert single ticker to list for consistent processing
    if isinstance(tickers, str):
        tickers = [tickers]
    
    # Filter the data using MultiIndex slicing
    try:
        filtered_data = features_df.loc[(tickers, slice(date_start, date_end)), :]
    except Exception as e:
        if verbose:
            print(f"Error filtering data: {e}")
        return pd.DataFrame() if return_format == "dataframe" else {}
    
    # Handle empty results
    if filtered_data.empty:
        if verbose:
            print(f"No data found for tickers {tickers} in date range {date_start} to {date_end}")
        return filtered_data
    
    # Print summary if verbose
    if verbose:
        print(f"Features data retrieved for {len(tickers)} ticker(s) from {date_start} to {date_end}")
        print(f"Total rows: {len(filtered_data)}")
        print(f"Date range in data: {filtered_data.index.get_level_values(1).min()} to "
              f"{filtered_data.index.get_level_values(1).max()}")
        print(f"Available features: {', '.join(filtered_data.columns.tolist())}")
        
        # Print ticker-specific counts
        ticker_counts = filtered_data.index.get_level_values(0).value_counts()
        for ticker in tickers:
            count = ticker_counts.get(ticker, 0)
            if count > 0:
                print(f"  {ticker}: {count} rows")
            else:
                print(f"  {ticker}: No data in range")
    
    # Return in requested format
    if return_format == "dict":
        result = {}
        for ticker in tickers:
            try:
                result[ticker] = filtered_data.xs(ticker, level=0).loc[date_start:date_end]
            except KeyError:
                result[ticker] = pd.DataFrame()
        return result
    
    elif return_format == "separate":
        result = []
        for ticker in tickers:
            try:
                result.append(filtered_data.xs(ticker, level=0).loc[date_start:date_end])
            except KeyError:
                result.append(pd.DataFrame())
        return result
    
    elif return_format == "dataframe":
        return filtered_data
    
    else:
        raise ValueError(f"Invalid return_format: {return_format}. "
                         f"Must be 'dataframe', 'dict', or 'separate'")

def create_combined_dict(
    df_ohlcv: pd.DataFrame,
    features_df: pd.DataFrame,
    tickers: Union[str, List[str]],
    date_start: str,
    date_end: str,
    verbose: bool = True
) -> dict:
    """
    Create a combined dictionary with both OHLCV and features data for each ticker.
    
    Parameters:
    -----------
    df_ohlcv : pd.DataFrame
        DataFrame with OHLCV data (MultiIndex: ticker, date)
    features_df : pd.DataFrame
        DataFrame with features data (MultiIndex: ticker, date)
    tickers : str or list of str
        Ticker symbol(s) to retrieve
    date_start : str
        Start date in 'YYYY-MM-DD' format
    date_end : str
        End date in 'YYYY-MM-DD' format
    verbose : bool, optional
        Whether to print progress information (default: True)
    
    Returns:
    --------
    dict
        Dictionary with tickers as keys and combined DataFrames (OHLCV + features) as values
    """
    # Convert single ticker to list
    if isinstance(tickers, str):
        tickers = [tickers]
    
    if verbose:
        print(f"Creating combined dictionary for {len(tickers)} ticker(s)")
        print(f"Date range: {date_start} to {date_end}")
        print("=" * 60)
    
    # Get OHLCV data as dictionary
    ohlcv_dict = get_ticker_OHLCV(
        df_ohlcv, tickers, date_start, date_end, 
        return_format='dict', verbose=verbose
    )
    
    # Get features data as dictionary
    features_dict = get_ticker_features(
        features_df, tickers, date_start, date_end,
        return_format='dict', verbose=verbose
    )
    
    # Create combined_dict
    combined_dict = {}
    
    for ticker in tickers:
        if verbose:
            print(f"\nProcessing {ticker}...")
        
        # Check if ticker exists in both dictionaries
        if ticker in ohlcv_dict and ticker in features_dict:
            ohlcv_data = ohlcv_dict[ticker]
            features_data = features_dict[ticker]
            
            # Check if both dataframes have data
            if not ohlcv_data.empty and not features_data.empty:
                # Combine OHLCV and features data
                # Note: Both dataframes have the same index (dates), so we can concatenate
                combined_df = pd.concat([ohlcv_data, features_data], axis=1)
                
                # Ensure proper index naming
                combined_df.index.name = 'Date'
                
                # Store in combined_dict
                combined_dict[ticker] = combined_df
                
                if verbose:
                    print(f"  ‚úì Successfully combined data")
                    print(f"  OHLCV shape: {ohlcv_data.shape}")
                    print(f"  Features shape: {features_data.shape}")
                    print(f"  Combined shape: {combined_df.shape}")
                    print(f"  Date range: {combined_df.index.min()} to {combined_df.index.max()}")
            else:
                if verbose:
                    print(f"  ‚úó Cannot combine: One or both dataframes are empty")
                    print(f"    OHLCV empty: {ohlcv_data.empty}")
                    print(f"    Features empty: {features_data.empty}")
                combined_dict[ticker] = pd.DataFrame()
        else:
            if verbose:
                print(f"  ‚úó Ticker not found in both dictionaries")
                if ticker not in ohlcv_dict:
                    print(f"    Not in OHLCV data")
                if ticker not in features_dict:
                    print(f"    Not in features data")
            combined_dict[ticker] = pd.DataFrame()
    
    # Print summary
    if verbose:
        print("\n" + "=" * 60)
        print("SUMMARY")
        print("=" * 60)
        print(f"Total tickers processed: {len(tickers)}")
        
        tickers_with_data = [ticker for ticker, df in combined_dict.items() if not df.empty]
        print(f"Tickers with combined data: {len(tickers_with_data)}")
        
        if tickers_with_data:
            print("\nTicker details:")
            for ticker in tickers_with_data:
                df = combined_dict[ticker]
                print(f"  {ticker}: {df.shape} - {df.index.min()} to {df.index.max()}")
                print(f"    Columns: {len(df.columns)}")
        
        empty_tickers = [ticker for ticker, df in combined_dict.items() if df.empty]
        if empty_tickers:
            print(f"\nTickers with no data: {', '.join(empty_tickers)}")
    
    return combined_dict


In [2]:
data_path = r'c:\Users\ping\Files_win10\python\py311\stocks\data\df_OHLCV_stocks_etfs.parquet'
df_ohlcv = pd.read_parquet(data_path, engine='pyarrow')
print(f'df_ohlcv.info():\n{df_ohlcv.info()}')
df_ohlcv

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 9657171 entries, ('A', Timestamp('1999-11-18 00:00:00')) to ('ZWS', Timestamp('2025-12-03 00:00:00'))
Data columns (total 5 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Adj Open   float64
 1   Adj High   float64
 2   Adj Low    float64
 3   Adj Close  float64
 4   Volume     int64  
dtypes: float64(4), int64(1)
memory usage: 405.9+ MB
df_ohlcv.info():
None


Unnamed: 0_level_0,Unnamed: 1_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,1999-11-18,27.2452,29.9398,23.9519,26.3470,74716395
A,1999-11-19,25.7108,25.7482,23.8396,24.1764,18198354
A,1999-11-22,24.7378,26.3470,23.9893,26.3470,7857764
A,1999-11-23,25.4488,26.1225,23.9518,23.9518,7138322
A,1999-11-24,24.0267,25.1120,23.9518,24.5881,5785610
...,...,...,...,...,...,...
ZWS,2025-11-26,47.5400,48.7000,47.3000,48.1300,1154100
ZWS,2025-11-28,48.4600,48.4800,47.7000,47.7000,481400
ZWS,2025-12-01,47.1700,48.1800,47.1500,47.7400,608100
ZWS,2025-12-02,47.9800,48.3100,47.7100,47.8200,838200


In [None]:
results_container, debug_container = plot_walk_forward_analyzer(
    df_ohlcv=df_ohlcv,
    default_start_date='2025-08-13',
    default_calc_period=10,
    default_fwd_period=5,
    default_metric='Sharpe (ATR)',
    default_rank_start=15,
    default_rank_end=16,
    default_benchmark_ticker='VOO', 
    master_calendar_ticker='VOO',    
    quality_thresholds = { 
        'min_median_dollar_volume': 100_000, # A low "hard floor" to filter absolute errors/garbage
        # If min_liquidity_percentile is 0.8 (Top 20%), we want values > the 0.8 quantile.            
        'min_liquidity_percentile': 0.50,    # Dynamic: Only keep the top 50% of stocks by volume
        'max_stale_pct': 0.05, 
        'max_same_vol_count': 10
    },
    debug=True  # <-- Activate the new mode!
)

In [3]:
features_df = generate_features(df_ohlcv=df_ohlcv)

In [4]:
tickers = ['VOO', 'VLO', 'JPST']
date_start = '2025-08-13'
date_end = '2025-09-04'

In [7]:
# Create combined dictionary
combined_dict = create_combined_dict(
    df_ohlcv=df_ohlcv,
    features_df=features_df,
    tickers=tickers,
    date_start=date_start,
    date_end=date_end,
    verbose=False
)

In [8]:
combined_dict['JPST']

Unnamed: 0_level_0,Adj Open,Adj High,Adj Low,Adj Close,Volume,TR,ATR,ATRP,RollingStalePct,RollMedDollarVol,RollingSameVolCount
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2025-08-13,49.8859,49.8957,49.876,49.8859,3998082,0.0197,0.025613,0.000513,0.0,253494900.0,0.0
2025-08-14,49.8859,49.8918,49.876,49.876,3662718,0.0158,0.024912,0.000499,0.0,252192900.0,0.0
2025-08-15,49.9007,49.9056,49.8957,49.8957,3421622,0.0296,0.025247,0.000506,0.0,252192900.0,0.0
2025-08-18,49.9056,49.9056,49.8957,49.8957,4062923,0.0099,0.02415,0.000484,0.0,252192900.0,0.0
2025-08-19,49.9056,49.9253,49.9056,49.9253,3869010,0.0296,0.02454,0.000492,0.0,250684000.0,0.0
2025-08-20,49.9253,49.9352,49.9154,49.9352,4704730,0.0198,0.024201,0.000485,0.0,250684000.0,0.0
2025-08-21,49.9154,49.9352,49.9056,49.9056,3881288,0.0296,0.024587,0.000493,0.0,250684000.0,0.0
2025-08-22,49.9352,49.9647,49.9253,49.9647,4340447,0.0591,0.027052,0.000541,0.0,250684000.0,0.0
2025-08-25,49.9647,49.9647,49.9549,49.9549,4474897,0.0098,0.02582,0.000517,0.0,250684000.0,0.0
2025-08-26,49.9647,49.9746,49.9647,49.9746,4887785,0.0197,0.025383,0.000508,0.0,250684000.0,0.0


In [None]:
print_nested(results_container)
print('='*20)
print_nested(debug_container)

In [None]:
debug_container

In [None]:
my_tickers = ['SPY', 'AAPL', 'IWM', 'QQQ', 'META', 'EEM', 'BABA']
my_tickers = ['NTES', 'LII',]


In [None]:
# Assuming you have run the variables setup from the previous step
snapshot_df = debug_container[0]['audit_liquidity']['universe_snapshot']

if 'AAPL' in snapshot_df.index:
    display(snapshot_df.loc[['AAPL']])
else:
    print("AAPL was not present in the data for this date.")

In [None]:
snapshot_df.to_csv('./export_csv/snapshot_df.csv')
print(f"‚úÖ Snapshot exported to: ./export_csv/snapshot_df.csv")
print(f"   Shape: {snapshot_df.shape}")
print(f"   Columns: {list(snapshot_df.columns)}")

In [None]:
# 1. Access the data inside the container list
current_debug_data = debug_container[0]

# 2. Check if the audit data exists (it is created only in 'Ranking' mode)
if current_debug_data and 'audit_liquidity' in current_debug_data:
    audit = current_debug_data['audit_liquidity']
    snapshot_df = audit['universe_snapshot']
    
    print(f"üìÖ Date: {audit['date'].date()}")
    print(f"üí∞ Calculated Cutoff: ${audit['final_cutoff_usd']:,.0f}")
    print("-" * 30)

# 3. View the tickers right around the cutoff point
# Find the index where 'Passed_Vol_Check' switches from True to False
    try:
        # Get the integer location (iloc) of the last True value
        last_pass_iloc = np.where(snapshot_df['Passed_Vol_Check'])[0][-1]
        
        # Show 5 rows before and 5 rows after the cutoff
        start = max(0, last_pass_iloc - 5)
        end = min(len(snapshot_df), last_pass_iloc + 6)
        
        display(snapshot_df.iloc[start:end].style.format({
            'RollMedDollarVol': '${:,.0f}',
            'Calculated_Cutoff': '${:,.0f}',
            'RollingStalePct': '{:.1%}'
        }))
    except IndexError:
        print("Could not determine cutoff boundary (maybe all passed or all failed).")
        display(snapshot_df.head())
else:
    print("‚ö†Ô∏è No audit data found. Make sure you are in 'Ranking' mode and have clicked 'Update Chart'.")    

In [None]:
display(snapshot_df)

In [62]:
import pandas as pd
import numpy as np
import sys
import os

def test_true_range_calculation():
    """Test TR = max(High-Low, |High-PrevClose|, |Low-PrevClose|)"""
    print("Running test_true_range_calculation...")
    
    # 1. SETUP: Create test data for all three conditions
    test_data = {
        'Adj Open': [100, 105, 95, 98, 102],
        'Adj High': [105, 108, 97, 102, 105],
        'Adj Low': [95, 103, 93, 100, 98],
        'Adj Close': [100, 106, 96, 99, 103],
        'Volume': [1000, 1200, 800, 900, 1100]
    }
    
    index = pd.MultiIndex.from_tuples(
        [
            ('TEST', pd.Timestamp('2024-01-01')),  # Day 1: No previous close
            ('TEST', pd.Timestamp('2024-01-02')),  # Day 2: Prev close = 100, High=108, Low=103
            ('TEST', pd.Timestamp('2024-01-03')),  # Day 3: Prev close = 106, High=97, Low=93
            ('TEST', pd.Timestamp('2024-01-04')),  # Day 4: Prev close = 96, High=102, Low=100
            ('TEST', pd.Timestamp('2024-01-05')),  # Day 5: Prev close = 99, High=105, Low=98
        ],
        names=['Ticker', 'Date']
    )
    
    df_test = pd.DataFrame(test_data, index=index)

    print(f'test_true_range_df:\n{df_test}\n')
    
    # 2. EXECUTION: Run function with small window for testing
    result = generate_features(df_test, quality_window=5, quality_min_periods=2)
    
    print("\nGenerated DataFrame:")
    print(result)
    
    # 3. ASSERTIONS: Verify TR values for all conditions
    
    all_passed = True
    
    # Day 1: Should be NaN (no previous close)
    if pd.isna(result.loc[('TEST', '2024-01-01'), 'TR']):
        print("‚úì Day 1 TR is NaN (correct - no previous close)")
    else:
        print(f"‚úó Day 1 TR should be NaN but got: {result.loc[('TEST', '2024-01-01'), 'TR']}")
        all_passed = False
    
    # Day 2: Condition 1 - Previous Close (100) is BETWEEN today's High (108) and Low (103)
    # max(108-103=5, |108-100|=8, |103-100|=3) = 8
    expected_tr_day2 = 8.0
    actual_tr_day2 = result.loc[('TEST', '2024-01-02'), 'TR']
    if abs(actual_tr_day2 - expected_tr_day2) < 0.0001:
        print(f"‚úì Day 2 TR is {actual_tr_day2} (expected {expected_tr_day2}) - Previous Close BETWEEN High and Low")
    else:
        print(f"‚úó Day 2 TR should be {expected_tr_day2} but got {actual_tr_day2}")
        all_passed = False
    
    # Day 3: Condition 2 - Previous Close (106) is ABOVE today's High (97)
    # max(97-93=4, |97-106|=9, |93-106|=13) = 13
    expected_tr_day3 = 13.0
    actual_tr_day3 = result.loc[('TEST', '2024-01-03'), 'TR']
    if abs(actual_tr_day3 - expected_tr_day3) < 0.0001:
        print(f"‚úì Day 3 TR is {actual_tr_day3} (expected {expected_tr_day3}) - Previous Close ABOVE High")
    else:
        print(f"‚úó Day 3 TR should be {expected_tr_day3} but got {actual_tr_day3}")
        all_passed = False
    
    # Day 4: Condition 3 - Previous Close (96) is BELOW today's Low (100)
    # max(102-100=2, |102-96|=6, |100-96|=4) = 6
    expected_tr_day4 = 6.0
    actual_tr_day4 = result.loc[('TEST', '2024-01-04'), 'TR']
    if abs(actual_tr_day4 - expected_tr_day4) < 0.0001:
        print(f"‚úì Day 4 TR is {actual_tr_day4} (expected {expected_tr_day4}) - Previous Close BELOW Low")
    else:
        print(f"‚úó Day 4 TR should be {expected_tr_day4} but got {actual_tr_day4}")
        all_passed = False
    
    # Day 5: Additional test - Previous Close (99) is BETWEEN today's High (105) and Low (98)
    # max(105-98=7, |105-99|=6, |98-99|=1) = 7 (High-Low dominates)
    expected_tr_day5 = 7.0
    actual_tr_day5 = result.loc[('TEST', '2024-01-05'), 'TR']
    if abs(actual_tr_day5 - expected_tr_day5) < 0.0001:
        print(f"‚úì Day 5 TR is {actual_tr_day5} (expected {expected_tr_day5}) - Previous Close BETWEEN, High-Low dominates")
    else:
        print(f"‚úó Day 5 TR should be {expected_tr_day5} but got {actual_tr_day5}")
        all_passed = False
    
    if all_passed:
        print("\n‚úÖ All TR tests passed! All conditions covered:")
        print("   - Day 2: Previous Close (100) BETWEEN High (108) and Low (103)")
        print("   - Day 3: Previous Close (106) ABOVE High (97)")
        print("   - Day 4: Previous Close (96) BELOW Low (100)")
        print("   - Day 5: Previous Close (99) BETWEEN High (105) and Low (98)")
    else:
        print("\n‚ùå Some tests failed!")
    
    return all_passed

def test_atr_calculation():
    """Test ATR = EWMA of TR with alpha=1/period"""
    print("\n" + "="*50)
    print("Running test_atr_calculation...")
    
    # Test data with 5 days
    test_data = {
        'Adj Open': [100, 102, 103, 110, 108],
        'Adj High': [101, 103, 103, 112, 110],
        'Adj Low': [99, 101, 103, 108, 107],
        'Adj Close': [100, 102, 103, 111, 109],
        'Volume': [1000, 1000, 1000, 1000, 1000]  # All non-zero for simplicity
    }
    
    index = pd.MultiIndex.from_tuples(
        [('TEST', pd.Timestamp(f'2024-01-{i:02d}')) for i in range(1, 6)],
        names=['Ticker', 'Date']
    )
    
    df_test = pd.DataFrame(test_data, index=index)

    print(f'test_true_range_df:\n{df_test}\n')

    result = generate_features(df_test, atr_period=14)
    
    print("\nATR Calculation Results:")
    print(result[['TR', 'ATR', 'ATRP']])
    
    # Manual calculation from our earlier example
    # CORRECTED EXPECTED VALUES WITH MORE PRECISION
    expected_atr = {
        '2024-01-02': 3.0,
        '2024-01-03': 40/14,  # ‚âà 2.857142857142857
        '2024-01-04': 646/196,  # ‚âà 3.2959183673469388
        '2024-01-05': 9182/2744,  # ‚âà 3.3462099125364433
    }

    all_passed = True
    for date_str, expected in expected_atr.items():
        actual = result.loc[('TEST', pd.Timestamp(date_str)), 'ATR']
        if abs(actual - expected) < 0.0001:
            print(f"‚úì {date_str} ATR: {actual:.6f} ‚âà {expected:.6f}")
        else:
            print(f"‚úó {date_str} ATR: {actual:.6f} != {expected:.6f}")
            all_passed = False
    
    if all_passed:
        print("\n‚úÖ All ATR tests passed!")
    else:
        print("\n‚ùå Some ATR tests failed!")
    
    return all_passed

def test_is_stale_calculation():
    """Test IsStale = 1 when Volume=0 OR High=Low"""
    print("\n" + "="*50)
    print("Running test_is_stale_calculation...")
    
    test_data = {
        'Adj Open': [100, 102, 103, 104],
        'Adj High': [101, 103, 103, 105],  # Day 3: High=Low
        'Adj Low': [99, 101, 103, 104],
        'Adj Close': [100, 102, 103, 105],
        'Volume': [1000, 0, 500, 1000]  # Day 2: Volume=0
    }
    
    index = pd.MultiIndex.from_tuples(
        [('TEST', pd.Timestamp(f'2024-01-{i:02d}')) for i in range(1, 5)],
        names=['Ticker', 'Date']
    )
    
    df_test = pd.DataFrame(test_data, index=index)

    print(f'test_is_stale_df:\n{df_test}\n')

    # Create IsStale manually to verify
    is_stale_manual = np.where(
        (df_test['Volume'] == 0) | (df_test['Adj High'] == df_test['Adj Low']),
        1, 0
    )
    
    print("\nüìä Manual IsStale Calculation:")
    print("=" * 50)
    print("IsStale = 1 if EITHER condition is true:")
    print("  1. Volume == 0")
    print("  2. Adj High == Adj Low (no price movement)")
    print("Otherwise, IsStale = 0")
    print("=" * 50)
    
    # Create a temporary DataFrame to display the calculation clearly
    manual_calc_df = df_test.copy()
    manual_calc_df['IsStale_Manual'] = is_stale_manual
    manual_calc_df['Volume==0'] = manual_calc_df['Volume'] == 0
    manual_calc_df['High==Low'] = manual_calc_df['Adj High'] == manual_calc_df['Adj Low']
    
    print("\nCalculation details:")
    for idx, row in manual_calc_df.iterrows():
        ticker_date = f"{idx[0]}, {idx[1].strftime('%Y-%m-%d')}"
        conditions = []
        if row['Volume==0']:
            conditions.append("Volume=0")
        if row['High==Low']:
            conditions.append("High=Low")
        
        condition_str = " OR ".join(conditions) if conditions else "None (both False)"
        result = row['IsStale_Manual']
        
        print(f"  {ticker_date}:")
        print(f"    Volume={row['Volume']}, High={row['Adj High']}, Low={row['Adj Low']}")
        print(f"    Conditions met: {condition_str}")
        print(f"    ‚Üí IsStale = {result}")
        print()

    expected = [0, 1, 1, 0]  # Day 1: normal, Day 2: vol=0, Day 3: high=low, Day 4: normal
    
    print(f"\nManual IsStale calculation: {is_stale_manual}")
    print(f"Expected: {expected}")
    
    if list(is_stale_manual) == expected:
        print("‚úì IsStale calculation logic is correct")
        return True
    else:
        print(f"‚úó IsStale calculation failed. Got {is_stale_manual}, expected {expected}")
        return False

def test_multiple_tickers():
    """Test that calculations don't mix data between tickers"""
    print("\n" + "="*50)
    print("Running test_multiple_tickers...")
    
    test_data = {
        'Adj Open': [100, 102, 50, 51],
        'Adj High': [101, 103, 52, 53],
        'Adj Low': [99, 101, 48, 49],
        'Adj Close': [100, 102, 49, 52],
        'Volume': [1000, 1000, 2000, 2000]
    }
    
    index = pd.MultiIndex.from_tuples([
        ('A', pd.Timestamp('2024-01-01')),
        ('A', pd.Timestamp('2024-01-02')),
        ('B', pd.Timestamp('2024-01-01')),
        ('B', pd.Timestamp('2024-01-02')),
    ], names=['Ticker', 'Date'])
    
    df_test = pd.DataFrame(test_data, index=index)

    print(f'test_multiple_tickers_df:\n{df_test}\n')

    result = generate_features(df_test)
    
    print("\nMultiple Ticker Results:")
    print(result[['TR', 'ATR']])
    
    # Ticker A day 2 TR should use A day 1 close, not B day 1 close
    tr_a2 = result.loc[('A', '2024-01-02'), 'TR']
    expected_a2 = 3.0  # max(103-101=2, |103-100|=3, |101-100|=1) = 3
    
    tr_b2 = result.loc[('B', '2024-01-02'), 'TR']
    expected_b2 = 4.0  # max(53-49=4, |53-49|=4, |49-49|=0) = 4
    
    tests_passed = 0
    total_tests = 2
    
    if abs(tr_a2 - expected_a2) < 0.0001:
        print(f"‚úì Ticker A TR: {tr_a2} (expected {expected_a2})")
        tests_passed += 1
    else:
        print(f"‚úó Ticker A TR: {tr_a2} != {expected_a2}")
    
    if abs(tr_b2 - expected_b2) < 0.0001:
        print(f"‚úì Ticker B TR: {tr_b2} (expected {expected_b2})")
        tests_passed += 1
    else:
        print(f"‚úó Ticker B TR: {tr_b2} != {expected_b2}")
    
    if tests_passed == total_tests:
        print("‚úÖ Ticker separation test passed!")
        return True
    else:
        print(f"‚ùå Ticker separation test failed: {tests_passed}/{total_tests} passed")
        return False

def test_edge_cases():
    """Test edge cases like zero price, single row, etc."""
    print("\n" + "="*50)
    print("Running test_edge_cases...")
    
    all_passed = True
    
    # Test 1: Very low price (penny stock)
    print("\n1. Testing penny stock with low price...")
    test_data = {
        'Adj Open': [0.10, 0.11],
        'Adj High': [0.10, 0.11],
        'Adj Low': [0.10, 0.11],
        'Adj Close': [0.10, 0.11],
        'Volume': [1000, 1000]
    }
    
    index = pd.MultiIndex.from_tuples([
        ('PENNY', pd.Timestamp('2024-01-01')),
        ('PENNY', pd.Timestamp('2024-01-02')),
    ], names=['Ticker', 'Date'])
    
    df_penny = pd.DataFrame(test_data, index=index)

    print(f'df_penny_stock:\n{df_penny}\n')

    result = generate_features(df_penny)
    
    # Check ATRP is reasonable (not inf/nan)
    atrp_val = result.loc[('PENNY', '2024-01-02'), 'ATRP']
    if pd.isna(atrp_val) or np.isinf(atrp_val):
        print(f"‚úó Penny stock ATRP is {atrp_val} (should be finite)")
        all_passed = False
    else:
        print(f"‚úì Penny stock ATRP is {atrp_val:.4f}")
    
    # Test 2: Single row
    print("\n2. Testing single row data...")
    test_data_single = {
        'Adj Open': [100],
        'Adj High': [101],
        'Adj Low': [99],
        'Adj Close': [100],
        'Volume': [1000]
    }
    
    index_single = pd.MultiIndex.from_tuples(
        [('SINGLE', pd.Timestamp('2024-01-01'))],
        names=['Ticker', 'Date']
    )
    
    df_single = pd.DataFrame(test_data_single, index=index_single)

    print(f'df_single:\n{df_single}\n')

    result_single = generate_features(df_single, quality_window=3, quality_min_periods=2)
    
    # TR should be NaN (no previous close)
    if pd.isna(result_single.loc[('SINGLE', '2024-01-01'), 'TR']):
        print("‚úì Single row TR is NaN (correct)")
    else:
        print(f"‚úó Single row TR should be NaN but got {result_single.loc[('SINGLE', '2024-01-01'), 'TR']}")
        all_passed = False
    
    # Rolling metrics should be NaN with min_periods=2
    if pd.isna(result_single.loc[('SINGLE', '2024-01-01'), 'RollingStalePct']):
        print("‚úì Single row rolling metrics are NaN (correct - insufficient periods)")
    else:
        print(f"‚úó Rolling metrics should be NaN but got {result_single.loc[('SINGLE', '2024-01-01'), 'RollingStalePct']}")
        all_passed = False
    
    if all_passed:
        print("\n‚úÖ All edge case tests passed!")
    else:
        print("\n‚ùå Some edge case tests failed!")
    
    return all_passed


def run_all_tests():
    """Run all tests and provide summary"""
    print("="*60)
    print("STARTING ALL TESTS FOR generate_features()")
    print("="*60)
    
    test_results = {}
    
    # Run each test
    test_results['TR Calculation'] = test_true_range_calculation()
    test_results['ATR Calculation'] = test_atr_calculation()
    test_results['IsStale Calculation'] = test_is_stale_calculation()
    test_results['Multiple Tickers'] = test_multiple_tickers()
    test_results['Edge Cases'] = test_edge_cases()
    
    # Summary
    print("\n" + "="*60)
    print("TEST SUMMARY")
    print("="*60)
    
    passed = sum(test_results.values())
    total = len(test_results)
    
    for test_name, result in test_results.items():
        status = "‚úÖ PASS" if result else "‚ùå FAIL"
        print(f"{status}: {test_name}")
    
    print("\n" + "="*60)
    print(f"TOTAL: {passed}/{total} tests passed ({passed/total*100:.1f}%)")
    print("="*60)
    
    if passed == total:
        print("\nüéâ ALL TESTS PASSED! Your feature calculations are working correctly.")
        return True
    else:
        print(f"\n‚ö†Ô∏è  {total - passed} test(s) failed. Review the output above.")
        return False

run_all_tests()

STARTING ALL TESTS FOR generate_features()
Running test_true_range_calculation...
test_true_range_df:
                   Adj Open  Adj High  Adj Low  Adj Close  Volume
Ticker Date                                                      
TEST   2024-01-01       100       105       95        100    1000
       2024-01-02       105       108      103        106    1200
       2024-01-03        95        97       93         96     800
       2024-01-04        98       102      100         99     900
       2024-01-05       102       105       98        103    1100


Generated DataFrame:
                     TR     ATR    ATRP  RollingStalePct  RollMedDollarVol  RollingSameVolCount
Ticker Date                                                                                    
TEST   2024-01-01   NaN     NaN     NaN              NaN               NaN                  NaN
       2024-01-02   8.0  8.0000  0.0755              0.0          113600.0                  0.0
       2024-01-03  13.0  8.35

True

In [None]:
# Run the specific test
pytest test_features.py::test_true_range_calculation -v

<function __main__.test_true_range_calculation()>

In [None]:
================================  
================================  
================================  