In [2]:
import pandas as pd
import os

notebook_dir = os.getcwd()
data_dir = notebook_dir + "/data"

# Load data
X_train = pd.read_csv(f"{data_dir}/X_train.csv", index_col=0)
y_train = pd.read_csv(f"{data_dir}/y_train.csv", index_col=0).squeeze()
X_test = pd.read_csv(f"{data_dir}/X_test.csv", index_col=0)


Given the characteristics of the dataset:

- **Moderate feature count**
- **Moderate sample size**
- **Structured, tabular data (no time series)**

A **tree-based model** is an appropriate choice.

I selected **Histogram-based Gradient Boosting (HGB)** because **LightGBM** and **XGBoost** could not be installed on my macOS system due to missing system libraries and a 32-bit Python environment. HGB was available out of the box, so no additional installations were needed.


In [3]:
from sklearn.impute import SimpleImputer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance

RANDOM_STATE = 42

#Impute missing values first (Instead of filling with 0, like in the base model)
imputer = SimpleImputer(strategy="median")
X_train_imp = imputer.fit_transform(X_train)
X_test_imp = imputer.transform(X_test)


Next, I performed **feature selection** to improve the model's generalization.  
The goals were to:

- Reduce noise in the dataset  
- Potentially improve model performance  
- Increase interpretability by focusing on the most important features

In [4]:
model = HistGradientBoostingClassifier(random_state=RANDOM_STATE)
model.fit(X_train_imp, y_train)

r = permutation_importance(model, X_train_imp, y_train, n_repeats=5, random_state=RANDOM_STATE)
importances = pd.Series(r.importances_mean, index=X_train.columns).sort_values(ascending=False)

print("Top 10 important features:\n", importances.head(10)) # just to see


Top 10 important features:
 US_3M_yield             0.042066
US_Credit_Spread        0.041736
US_bonds_implied_vol    0.035207
JPN_Momentum_100        0.032314
EU_3M_yield             0.022025
EU_stock_implied_vol    0.020868
WORLD_returns           0.018636
WORLD_Momentum_20       0.018140
Gold_Momentum_20        0.017438
US_PE                   0.016446
dtype: float64


From the feature importance ranking, we can see that **UST market variables** have the highest impact on the global bear/bull indicator. This is intuitive, as they often act as key risk-on/risk-off signals for global liquidity.

**US_3M_yield**  
- Serves as a proxy for the monetary policy stance and global liquidity conditions.  
- A sharp rise typically reflects tightening liquidity and tends to correspond with risk-off behavior.

**US_Credit_Spread**  
- A widening credit spread signals increasing credit risk and heightened investor risk aversion.  
- This makes it a strong indicator of bearish market conditions.

**US_bonds_implied_vol**  
- Rising implied volatility reflects uncertainty around macroeconomic conditions or policy direction.  
- Such uncertainty often triggers defensive positioning in global markets.

**JPN_Momentum_100**  
- Its high importance also makes sense, as it acts as a proxy for the global **carry trade**.  
- Strong Japanese equity momentum → expansion of carry trades → risk-on sentiment → bullish market conditions.

In [5]:
threshold = importances.median()
selected_features = importances[importances >= threshold].index.tolist()
print(f"Selected {len(selected_features)} of {len(importances)} features")

# Create the filtered datasets for tuning
X_train_sel = X_train[selected_features]
X_test_sel = X_test[selected_features]


Selected 17 of 33 features


I kept the **top 50% most predictive variables** and discarded the rest, as the lower-ranked features were likely adding noise or redundancy.  
Since **HistGradientBoosting** naturally downweights irrelevant or redundant variables, aggressive feature filtering is not strictly necessary. Keeping the process simple helps maintain model stability while still improving generalization.


Next, I applied **Optuna** for hyperparameter tuning to identify the best model configuration using the reduced feature set.

To define reasonable search ranges, I used the **midpoint values** from the official scikit-learn documentation for `HistGradientBoostingClassifier` as reference, and created parameter intervals around those values for Optuna to explore. This ensures the search space is both informed and efficient, increasing the likelihood of finding strong configurations.

In [6]:
import optuna
from optuna.pruners import MedianPruner
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score
import numpy as np

RANDOM_STATE = 42

# Stronger-regularization search space (as it was highly overfitting)
def objective_penalized(trial):
    params = {
        # smaller learning rate range (safer)
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.08, log=True),
        "max_iter": trial.suggest_int("max_iter", 200, 800),
        # smaller trees and leaf nodes
        "max_depth": trial.suggest_int("max_depth", 3, 8),
        "max_leaf_nodes": trial.suggest_int("max_leaf_nodes", 7, 31),
        # stronger min samples per leaf
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 50, 400),
        # stronger l2 regularization
        "l2_regularization": trial.suggest_float("l2_regularization", 1e-1, 50.0, log=True),
        "max_bins": trial.suggest_int("max_bins", 64, 128),
        # keep early stopping on but shorter patience
        "early_stopping": True,
        "n_iter_no_change": 8,
        "validation_fraction": 0.12,
        "random_state": RANDOM_STATE,
    }

    pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="median")),
        ("hgb", HistGradientBoostingClassifier(**params)),
    ])

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
    cv_scores = cross_val_score(pipe, X_train_sel, y_train, cv=cv, scoring="accuracy", n_jobs=-1)

    cv_mean = float(np.mean(cv_scores))

    # Refit on full training set to compute train accuracy for gap
    pipe.fit(X_train_sel, y_train)
    train_pred = pipe.predict(X_train_sel)
    train_acc = float(accuracy_score(y_train, train_pred))

    gap = train_acc - cv_mean

    alpha = 0.6  # increase to punish overfitting more strongly (with 0.3 e.g. its Train Accuracy: 0.9826, Cross-Validated Accuracy: 0.8696 Generalization Gap: 0.1130)
    penalized_score = cv_mean - alpha * gap

    # prune if cv_mean is poor
    if trial.should_prune():
        raise optuna.TrialPruned()

    return penalized_score

# Run study with pruning
pruner = MedianPruner(n_startup_trials=10, n_warmup_steps=0, interval_steps=1)
study = optuna.create_study(direction="maximize", pruner=pruner, study_name="HGB_penalized")
study.optimize(objective_penalized, n_trials=80, n_jobs=1)

print("Best trial (penalized):", study.best_trial.number)
print("Best penalized score:", study.best_value)
best_params = study.best_trial.user_attrs.get("best_params", None)  # not used here, params below
best_params = study.best_trial.params
print("Best params:", best_params)



  from .autonotebook import tqdm as notebook_tqdm
[I 2025-11-06 13:26:14,957] A new study created in memory with name: HGB_penalized
[I 2025-11-06 13:26:18,093] Trial 0 finished with value: 0.7747933884297521 and parameters: {'learning_rate': 0.02185474176336963, 'max_iter': 649, 'max_depth': 7, 'max_leaf_nodes': 7, 'min_samples_leaf': 222, 'l2_regularization': 0.2869727819421245, 'max_bins': 64}. Best is trial 0 with value: 0.7747933884297521.
[I 2025-11-06 13:26:19,642] Trial 1 finished with value: 0.7772727272727272 and parameters: {'learning_rate': 0.041372871842566265, 'max_iter': 601, 'max_depth': 3, 'max_leaf_nodes': 21, 'min_samples_leaf': 94, 'l2_regularization': 3.327020853623081, 'max_bins': 71}. Best is trial 1 with value: 0.7772727272727272.
[I 2025-11-06 13:26:21,983] Trial 2 finished with value: 0.784793388429752 and parameters: {'learning_rate': 0.03620791254615446, 'max_iter': 565, 'max_depth': 7, 'max_leaf_nodes': 22, 'min_samples_leaf': 228, 'l2_regularization': 0.62

Best trial (penalized): 30
Best penalized score: 0.8025206611570248
Best params: {'learning_rate': 0.019897565005014772, 'max_iter': 742, 'max_depth': 5, 'max_leaf_nodes': 29, 'min_samples_leaf': 145, 'l2_regularization': 3.76492541993063, 'max_bins': 114}


The search space was intentionally conservative, because earlier experiments showed clear signs of overfitting (Train accuracy around 1).

I also implemented a **penalized objective**, where the cross-validation accuracy was adjusted by subtracting a penalty proportional to the train–CV accuracy gap.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import (
    accuracy_score, classification_report, brier_score_loss
)
from sklearn.model_selection import StratifiedKFold, cross_val_predict


final_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("hgb", HistGradientBoostingClassifier(**best_params))
])

final_pipe.fit(X_train_sel, y_train)
y_train_pred = final_pipe.predict(X_train_sel)
y_train_proba = final_pipe.predict_proba(X_train_sel)[:, 1]
train_acc = accuracy_score(y_train, final_pipe.predict(X_train_sel))

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_acc = cross_val_score(final_pipe, X_train_sel, y_train, cv=cv, scoring="accuracy", n_jobs=-1).mean()
y_pred_cv = cross_val_predict(final_pipe, X_train_sel, y_train, cv=cv, method="predict")
y_proba_cv = cross_val_predict(final_pipe, X_train_sel, y_train, cv=cv, method="predict_proba")[:, 1]


print(f"Train Accuracy: {train_acc:.4f}")
print(f"Cross-Validated Accuracy: {cv_acc:.4f}")
print(f"Generalization Gap: {train_acc - cv_acc:.4f}")
print("\nClassification Report (CV):")
print(classification_report(y_train, y_pred_cv))
print(f"\nBrier Score (CV): {brier_score_loss(y_train, y_proba_cv):.4f}")

y_pred_test = final_pipe.predict(X_test_sel)
pred_series = pd.Series(y_pred_test, index=X_test.index, name="Target")
pred_series.to_csv("y_pred.csv")

Train Accuracy: 0.8959
Cross-Validated Accuracy: 0.8262
Generalization Gap: 0.0696

Classification Report (CV):
              precision    recall  f1-score   support

         0.0       0.78      0.68      0.73      1647
         1.0       0.85      0.90      0.87      3193

    accuracy                           0.83      4840
   macro avg       0.81      0.79      0.80      4840
weighted avg       0.82      0.83      0.82      4840


Brier Score (CV): 0.1243


The generalization gap of ~0.10 indicates **mild overfitting**, but it is much smaller than in earlier experiments, showing that the stronger regularization and feature filtering were effective in stabilizing the model.

The model shows a clear **bull market bias**. It identifies bull markets (class 1) more accurately than bear markets (class 0). This is intuitive because bull markets tend to be more frequent and longer in duration.

The recall for bear markets (0.73) is noticeably lower, meaning the model misses some number of bear markets. However, this is consistent with the underlying data distribution, where bear markets are less common and often noisier or shorter-lived.

However, when lowering α to 0.3 (the penalty on overfitting), the model achieved a Train Accuracy of 0.9826 and Cross-Validated Accuracy of 0.8696, with a modestly wider generalization gap. While overall accuracy and bull-market performance remained stable, the model fits the training data more tightly and shows a better balance across classes. For asset-allocation applications, where missing bear markets can be more costly than false positives, this trade-off seems acceptable.

Overall, the model achieves a solid balance between predictive power and stability, especially given the noisy and imbalanced nature of financial-market regimes. Compared to the base-case with a CV of 0.66 it achieved a ~30% improvement. 

"Future work" could try additional gradient-boosting implementations (CatBoost, LightGBM, XGBoost) or more sophisticated feature selection.



