# Moon Cycles Deep Search (v2) â€” Model Bakeoff

Goal of this notebook (in one sentence):
- **Check if Moon-phase-only features contain any real predictive edge for BTC daily direction, or if it is basically random.**

Why we do this research:
- We want a *clean*, *honest* answer about the Moon signal by itself.
- If there is no edge here, we should not waste time building complex systems on top of it.

How we try to be honest (very important):
1. We use a strict **time split** (no shuffling). This reduces future leakage.
2. We test several model families (linear, trees, small neural net, XGBoost).
3. We tune parameters using **validation** only.
4. We keep **test** for the final check.
5. We report statistical checks (p-value vs 50% and Wilson 95% CI), because 52% can be noise.

What "edge" means here:
- If test accuracy CI includes 50% and p-value is not small (e.g. > 0.05), the result can be random.
- Recall-min below 50% means one side (UP or DOWN) is basically not predictable.
- MCC near 0 means "almost no correlation".

Extra practical check:
- At the end we also run a **very simple trading simulation** from the model signals.
  This is NOT production-grade trading. It is just a sanity-check: "does the signal help at all?".


In [None]:
from pathlib import Path
import sys

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

PROJECT_ROOT = Path('/home/rut/ostrofun')
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from RESEARCH2.Moon_cycles.moon_data import (
    MoonLabelConfig,
    build_moon_phase_features,
    load_market_slice,
)
from RESEARCH2.Moon_cycles.bakeoff_utils import run_moon_model_bakeoff
from RESEARCH2.Moon_cycles.eval_visuals import VisualizationConfig, evaluate_with_visuals

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 160)
from RESEARCH2.Moon_cycles.trading_utils import (
    TradingConfig,
    backtest_long_flat_signals,
    build_signal_from_proba,
    plot_backtest_price_and_equity,
    sweep_trading_params,
)


In [None]:
# ------------------------------
# Research configuration block
# ------------------------------

START_DATE = '2017-11-01'
END_DATE = None
USE_CACHE = True
VERBOSE = True

# If True, we use a wider Gaussian grid.
# Why: a narrow grid can miss the "only" area where any weak edge exists.
# Note: widening the grid increases runtime. Cache helps a lot.
WIDE_GAUSS_GRID = True

# Label parameters that stay fixed while we tune Gaussian window/std.
# These labels try to create a balanced UP/DOWN set using detrended future returns.
LABEL_CFG = MoonLabelConfig(
    horizon=1,
    move_share=0.5,
    label_mode='balanced_detrended',
    price_mode='raw',
)

# Gaussian grid to tune (label detrending parameters).
# The idea: different smoothing strength changes what is considered a "real move".
if WIDE_GAUSS_GRID:
    # Wider range: more windows + more stds.
    GAUSS_WINDOWS = [51, 101, 151, 201, 251, 301, 351, 401]
    GAUSS_STDS = [10.0, 15.0, 20.0, 25.0, 30.0, 40.0, 50.0, 70.0, 90.0]
else:
    # Narrower range: faster debug runs.
    GAUSS_WINDOWS = [101, 151, 201, 251, 301]
    GAUSS_STDS = [30.0, 50.0, 70.0, 90.0]

# Threshold tuning penalties (helps avoid one-class prediction collapse).
# In plain words:
# - gap penalty punishes "UP recall is good but DOWN recall is terrible" (or vice versa)
# - prior penalty punishes predicting UP too often (or too rarely) vs the true UP share
THRESHOLD_GAP_PENALTY = 0.25
THRESHOLD_PRIOR_PENALTY = 0.05

# XGBoost params (kept as baseline).
XGB_PARAMS = {
    'n_estimators': 500,
    'max_depth': 6,
    'learning_rate': 0.03,
    'colsample_bytree': 0.8,
    'subsample': 0.8,
    'early_stopping_rounds': 50,
}

# Dark-theme visuals.
VIS_CFG = VisualizationConfig(
    rolling_window_days=90,
    rolling_min_periods=30,
    probability_bins=64,
)

print('Config loaded.')
print('WIDE_GAUSS_GRID =', WIDE_GAUSS_GRID)
print('Gaussian grid size =', len(GAUSS_WINDOWS) * len(GAUSS_STDS))


In [None]:
# -------------------------------------------
# Load market data and build Moon-only features
# -------------------------------------------

df_market = load_market_slice(
    start_date=START_DATE,
    end_date=END_DATE,
    use_cache=USE_CACHE,
    verbose=VERBOSE,
)

df_moon_features = build_moon_phase_features(
    df_market=df_market,
    use_cache=USE_CACHE,
    verbose=VERBOSE,
    progress=True,
)

print('Market rows:', len(df_market))
print('Moon feature rows:', len(df_moon_features))
print('Market range:', df_market['date'].min().date(), '->', df_market['date'].max().date())

display(df_moon_features.head(5))

## Bakeoff Run

This section runs a grid over Gaussian label parameters and evaluates multiple models.

Important:
- We select the best Gaussian params **per model** using **validation metrics**.
- Only after that we look at test metrics.

In [None]:
bakeoff = run_moon_model_bakeoff(
    df_market=df_market,
    df_moon_features=df_moon_features,
    gauss_windows=GAUSS_WINDOWS,
    gauss_stds=GAUSS_STDS,
    label_cfg=LABEL_CFG,
    include_xgb=True,
    xgb_params=XGB_PARAMS,
    threshold_gap_penalty=THRESHOLD_GAP_PENALTY,
    threshold_prior_penalty=THRESHOLD_PRIOR_PENALTY,
    use_cache=USE_CACHE,
    verbose=VERBOSE,
)

results_table = bakeoff['results_table']
best_by_val = bakeoff['best_by_val_table']
best_runs = bakeoff['best_runs']

print('Bakeoff results rows:', len(results_table))
print('Best-by-validation table (one row per model):')
display(best_by_val)

In [None]:
# -------------------------------------------------
# Show top configs per model by VALIDATION quality
# -------------------------------------------------
# This makes it easy to see if any model consistently beats random.

show_cols = [
    'model',
    'gauss_window',
    'gauss_std',
    'val_recall_min',
    'val_recall_gap',
    'val_mcc',
    'val_acc',
    'test_recall_min',
    'test_recall_gap',
    'test_mcc',
    'test_acc',
    'p_value_vs_random',
    'baseline_majority_test_acc',
    'baseline_random_test_acc',
    'pred_up_share',
]

for model in sorted(results_table['model'].unique()):
    sub = results_table[results_table['model'] == model].copy()
    sub = sub.sort_values(
        ['val_recall_min', 'val_recall_gap', 'val_mcc', 'val_acc'],
        ascending=[False, True, False, False],
    )
    print()
    print('='*90)
    print('MODEL:', model)
    display(sub[show_cols].head(10))

## Extra sweep: threshold penalties (small grid)

Why we do this:
- The model outputs probabilities close to 0.50 in many cases.
- A bad threshold rule can collapse into "predict UP always" or "predict DOWN always".
- We already tune the threshold on validation, but the tuning objective has penalties.

So here we do a *small* grid over those penalty weights to check if the result is sensitive.

Important honesty note:
- We still treat TEST as a report-only set. We do not pick penalties by TEST.


In [None]:
# Small grid: (gap_penalty, prior_penalty)
# We keep it small on purpose to avoid huge multiple-comparison noise.
GAP_PENALTIES = [0.10, 0.25, 0.40]
PRIOR_PENALTIES = [0.00, 0.05, 0.10]

# We test penalties on the BEST gaussian configs we already found (one per model).
# This is much cheaper than re-running the full gaussian grid for every penalty pair.
best_gauss_pairs = (
    best_by_val[['gauss_window', 'gauss_std']]
    .drop_duplicates()
    .sort_values(['gauss_window', 'gauss_std'])
    .reset_index(drop=True)
)

penalty_rows = []

for _, r in best_gauss_pairs.iterrows():
    gw = int(r['gauss_window'])
    gs = float(r['gauss_std'])

    for gp in GAP_PENALTIES:
        for pp in PRIOR_PENALTIES:
            tmp = run_moon_model_bakeoff(
                df_market=df_market,
                df_moon_features=df_moon_features,
                gauss_windows=[gw],
                gauss_stds=[gs],
                label_cfg=LABEL_CFG,
                include_xgb=True,
                xgb_params=XGB_PARAMS,
                threshold_gap_penalty=float(gp),
                threshold_prior_penalty=float(pp),
                use_cache=USE_CACHE,
                verbose=VERBOSE,
            )

            t = tmp['best_by_val_table'].copy()
            t['gauss_window'] = gw
            t['gauss_std'] = gs
            t['gap_penalty'] = float(gp)
            t['prior_penalty'] = float(pp)
            penalty_rows.append(t)

df_penalties = pd.concat(penalty_rows, ignore_index=True)

# For inspection we sort by validation quality first (our honest selection rule).
df_penalties_sorted = df_penalties.sort_values(
    ['model', 'val_recall_min', 'val_recall_gap', 'val_mcc', 'val_acc'],
    ascending=[True, False, True, False, False],
)

display(
    df_penalties_sorted[[
        'model','gauss_window','gauss_std','gap_penalty','prior_penalty',
        'val_recall_min','val_recall_gap','val_mcc','val_acc',
        'test_recall_min','test_recall_gap','test_mcc','test_acc','p_value_vs_random',
    ]].head(30)
)


## Visual Diagnostics For Winners

For each model, we take its best-by-validation configuration and plot:
- confusion matrix
- predicted vs true label background over price
- rolling metrics
- probability histogram

This is the most "human readable" way to see if the model actually does something useful.

In [None]:
for model_name, run in best_runs.items():
    print()
    print('#' * 100)
    print('WINNER MODEL:', model_name)

    pred = run['predictions'].copy()
    test_df = pred[pred['split_role'] == 'test'].copy().reset_index(drop=True)
    test_df = test_df.dropna(subset=['pred_label'])

    # Evaluate baseline (majority) vs model on the SAME test period.
    _ = evaluate_with_visuals(
        df_plot=test_df,
        y_true=test_df['target'].to_numpy(dtype=np.int32),
        y_pred=test_df['baseline_majority'].to_numpy(dtype=np.int32),
        y_prob_up=None,
        title=f"{model_name.upper()} - BEFORE training (majority baseline)",
        vis_cfg=VIS_CFG,
        show_visuals=True,
    )

    _ = evaluate_with_visuals(
        df_plot=test_df,
        y_true=test_df['target'].to_numpy(dtype=np.int32),
        y_pred=test_df['pred_label'].to_numpy(dtype=np.int32),
        y_prob_up=test_df['pred_proba_up'].to_numpy(dtype=float),
        title=f"{model_name.upper()} - AFTER training (Moon-only)",
        vis_cfg=VIS_CFG,
        show_visuals=True,
    )

In [None]:
# -------------------------------------------------
# Quick conclusion helper
# -------------------------------------------------
# This is a simple text summary you can read like a report.

report_cols = [
    'model',
    'gauss_window',
    'gauss_std',
    'val_recall_min',
    'val_recall_gap',
    'val_mcc',
    'test_recall_min',
    'test_recall_gap',
    'test_mcc',
    'test_acc',
    'accuracy_ci95_low',
    'accuracy_ci95_high',
    'p_value_vs_random',
]

display(best_by_val[report_cols])

print()
print('Interpretation guide (very simple):')
print('- If test_acc CI includes 0.50 and p-value is not small (e.g. > 0.05), it looks like random.')
print('- If test_recall_min is below 0.50, the weaker class is basically not predictable.')
print('- If MCC is near 0, it is basically random.')

## Trading sanity-check (very simple)

This section answers a very practical question:

- If we blindly **buy when the model says UP** and **sell when it says DOWN**,
  does the equity curve look better than Buy&Hold *on the same test period*?

We keep the trading model intentionally simple because the goal is research clarity, not perfection:
- 0.1% fee per buy and per sell (configurable)
- optional stop-loss (approximation with daily close only)
- optional 'no-signal' neutral zone based on probability thresholds

If even this simple test looks random, it is a strong sign that Moon-only features have no edge.


In [None]:
# ------------------------------
# Trading backtest for each winner model (TEST period only)
# ------------------------------

FEE_RATE = 0.001  # 0.1%
STOP_LOSSES = [0.0, 0.02, 0.05, 0.08]  # 0%, 2%, 5%, 8%

# Neutral zone for 'no signal' example (optional).
# If prob is between [DOWN_TH, UP_TH], we set NaN => 'no-signal'.
UP_TH = 0.55
DOWN_TH = 0.45

for model_name, run in best_runs.items():
    print()
    print('#' * 100)
    print('TRADING CHECK FOR MODEL:', model_name)

    pred = run['predictions'].copy()
    test_df = pred[pred['split_role'] == 'test'].copy().reset_index(drop=True)
    test_df = test_df.dropna(subset=['pred_label', 'close'])

    # 1) Base trading signal: always decide (UP/DOWN).
    test_df['trade_signal'] = test_df['pred_label'].astype(float)

    base_cfg = TradingConfig(
        fee_rate=FEE_RATE,
        stop_loss_pct=0.0,
        exit_on_no_signal=False,
        close_final_position=True,
        initial_cash=1.0,
    )

    base_run = backtest_long_flat_signals(test_df, signal_col='trade_signal', cfg=base_cfg)
    print('Base trading metrics:', base_run['metrics'])
    plot_backtest_price_and_equity(base_run, title=f'{model_name.upper()} - Trading (always decide)', vis_cfg=VIS_CFG)

    # 2) Optional neutral-zone signal: model can say 'I am not sure'.
    # This can reduce overtrading when probabilities are near 0.50.
    test_df['trade_signal_neutral'] = build_signal_from_proba(
        test_df['pred_proba_up'].to_numpy(dtype=float),
        threshold_up=UP_TH,
        threshold_down=DOWN_TH,
    )

    # Quick sweep: try a few stop-loss values and the 'exit on no-signal' switch.
    sweep = sweep_trading_params(
        df=test_df,
        signal_col='trade_signal_neutral',
        stop_losses=STOP_LOSSES,
        exit_on_no_signal_options=(False, True),
        fee_rate=FEE_RATE,
        close_final_position=True,
        initial_cash=1.0,
        verbose=True,
    )

    display(sweep['results_table'].head(10))
    if sweep['best_run'] is not None:
        plot_backtest_price_and_equity(
            sweep['best_run'],
            title=f'{model_name.upper()} - Trading (best by ulcer-adjusted return)',
            vis_cfg=VIS_CFG,
        )
