# Stock Return Prediction Challenge by QRT

    I) Data Exploration and Understanding:
        1)  Load and inspect data
        2)  Visualize distributions
        3)  Correlation Analysis

    II) Handle Missing Values
        -   Usage of interpolation methods which makes sense for time series data.
        -   Forward-fill or backward-fill missing values if temporal order is assumed
        -   If no temporal continuity, then statistics like mean, median or conditiona averages

    III) Noise and Outlier Handling
        -   Robust statistics like the median

    IV) Feature Engineering
        1)  Feature Selection: Start with a feature selection and add features to see
            changes in predictive power. Can be used as some sort of abliation study.
        2)  Lagged Features: Compute rolling statistics (mean, std) for RET_1 to RET_20 and Volumes 
            over various windows
        3)  Interaction Terms: Create features that combine stock-specific and sector-specific trends.
        4)  If dates are randomized no extraction of temporal patterns is possible. Then focus on 
            relationships instead.
        5) Maybe use Log or Fourier Transforms and Features:
            -   Log Transform: To stabilize vairance and handle skewed data.    
                Works on individual features. Can reduce impact of outliers.
            -   Fourier Features: Captures periodic / seasonal patterns. 
                Requires sequential data (time-series). Can filter noise with low frequencies.
        6) Feature Selection through Feature Importance

    V) Models and Benchmark:
        1)  Start simple with the Baseline Random Forest Implementation and  
            cross validation to ensure proper assessment and avoid overfitting
        2)  Tryout of SOTA models such as CatBoost or AutoGluon by Amazon

In [1]:
# Load Necessary Libraries
import os
import copy
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from autogluon.tabular import TabularPredictor

from model.ml_model import ModelTrainer
from features.feature_engineering import generate_features

DATASET_PATH = 'data'

### Load the Data

In [2]:
DATASET_PATH = 'data'

# Load the relevant datasets
x_train = pd.read_csv(os.path.join(DATASET_PATH, 'x_train.csv'))
y_train = pd.read_csv(os.path.join(DATASET_PATH, 'y_train.csv'))

test_df = pd.read_csv(os.path.join(DATASET_PATH, 'x_test.csv'))
train_df = pd.concat((x_train, y_train), axis=1)

### Training Data with NaN Rows

In [3]:
return_features = [f'RET_{day}' for day in range(1,21)]
rows_to_drop = train_df[
    (train_df[return_features].isna().sum(axis=1) == len(return_features))
]
rows_to_drop
# It becomes clear that we want to delete all rows where for every single day NaN 
# is recorded since we can't interpolate or replace those values.
train_df.drop(index=rows_to_drop.index, inplace=True)

### Check for inbalances
A detailed analysis of the dataset distribution was conducted to determine whether most of the data follows a specific statistical pattern. Additionally, a statistical examination can provide insights into whether introducing new features would be beneficial. If a clear skew is detected, applying transformations such as logarithmic features can help improve the model's performance. Also to compute these the folder features, in particular statistical_features.py offers a more indepth analysis. 

In [None]:
print(train_df['RET'].value_counts(normalize=True)*100)
# From this result it can be seen that there are no inbalances.

### Handling of further NaN Values

In [5]:
for day in range(1, 21):
    # Replace missing VOLUME features
    train_df[f'VOLUME_{day}'] = train_df[f'VOLUME_{day}'].fillna(train_df[f'VOLUME_{day}'].median())
    test_df[f'VOLUME_{day}'] = test_df[f'VOLUME_{day}'].fillna(test_df[f'VOLUME_{day}'].median())
    
    # Replace missing RET features
    train_df[f'RET_{day}'] = train_df[f'RET_{day}'].fillna(train_df[f'RET_{day}'].median())
    test_df[f'RET_{day}'] = test_df[f'RET_{day}'].fillna(test_df[f'RET_{day}'].median())

### Compute Domain Specific Features 
These specific features can be found in the folder features. Where they are classified as Return Rate specific or Volume specific features. These features are commonly known signals used for portfolio management.

#### Return-Related Features:

    1. Autocorrelation of Returns (ACF)
    2. Rolling Sharpe Ratio
    3. Relative Strength Index (RSI)
    4. Rate of Change (ROC)
    5. Momentum
    6. Stochastic Oscillator
    7. Moving Average Convergence Divergence (MACD)
    8. Golden Cross
    9. Bollinger Bands
    10. Cumulative Return
    11. Money Flow Index (MFI)
    12. Average True Range (ATR)
    13. Lagged Returns (Multiple Lags)


#### Volume-Related Features:

    1. Volume Price Trend (VPT)
    2. On-Balance Volume (OBV)
    3. Chaikin Money Flow (CMF)
    4. Volume Weighted Average Price (VWAP)
    5. Volume Oscillator
    6. Moving Average Convergence Divergence (MACD) here on Volume
    7. Volume Change Ratio (VCR)
    8. Price-Volume-Trend (PVT)
    9. Average Volume (for the last 'n' days)
    10. Volume Deviation
    11. Volume Spike Detection
    12. Accumulation/Distribution Line (ADL)
    13. Relative Volume

#### Cross / Statistical Features:

    1. Return x Volume Interaction
    2. Lagged Return x Volume Interaction
    3. Cumulative Return - Volume Interaction
    4. Sector or Industry-Specific Trends

This is a rather extensive list that captures essential features for the return and volume.

In [None]:
train_df, test_df, new_features = generate_features(train_df, test_df)

# Further preparation to start training
target = 'RET'
original_features = list(x_train.keys())

remove_features = ['DATE', 'ID', 'STOCK', 'INDUSTRY', 'INDUSTRY_GROUP', 'SECTOR', 'SUB_INDUSTRY']

for feature in remove_features:
    original_features = list(filter(lambda x: x != feature, original_features))

features = new_features + original_features
print(f'Number of Features used for Prediction and Training: {len(features)}')

### Setup for Training

here you can choose whether to train a model or activate cross validation analyze its performance

In [7]:

X_train = train_df[features]
Y_train = train_df[target]

X_test = test_df[features]

use_model = 'CatBoost'
cross_validation = True

### Model Selection and Training

I tried out the following models:
1. CatBoost
2. RandomForest
3. AutoGluon
4. TabPFN
5. XGBoost
6. Ensembles

TabPFN is state of the art and a tabular foundation model was even presented just recently in the year 2025. The issue is that it can only handle small tabular datasets with less than 10k rows. In the [TabPFN](https://www.nature.com/articles/s41586-024-08328-6) paper, the authors mentioned that for large datasets and roughly a 100 features the best performing model turned out to be always AutoGluon. In my case I received bad results but this is most likely due to a bad choice of hyperparameter. Furthermore the model requires a lot of compute thus I ran it on the cluster. With a better hyperparameter strategy and more training I am confident that better results could have been achieved. In the End I used the result from CatBoost but also here through better hyperparameter tuning to avoid overfitting, better results could have been achieved. Eventhough in theory AutoGluon should already build ensembles, I tried to combine CatBoost, RandomForest and XGBoost as a manual ensemble with equal weighting.

$$
Ensemble(X) = \lambda_1 \cdot CatBoost(X) + \lambda_2 \cdot RandomForest(X) + \lambda_3 \cdot XGBoost(X)
$$

X is here the input and each component describes the predicted output which is computed through this weighting.

In [None]:

if use_model == 'RandomForest':
    rf_params = {
        'n_estimators': 500,
        'max_depth': 2**3,
        'random_state': 0,
        'n_jobs': -1
    }
    model = RandomForestClassifier(**rf_params)
    rf_model = ModelTrainer(model=RandomForestClassifier(**rf_params))

elif use_model == 'CatBoost':
    cat_params = {
        'iterations': 2000,
        'depth': 8,
        'learning_rate': 0.05,
        'loss_function': 'Logloss',
        'verbose': False
    }

    model = CatBoostClassifier(**cat_params)
    cat_model = ModelTrainer(model=CatBoostClassifier(**cat_params))

elif use_model == 'XGBoost':
    xg_params = {
        'n_estimators': 10, 
        'max_depth': 8,          
        'learning_rate': 0.05,  
        'subsample': 0.8,      
        'colsample_bytree': 0.8,   
        'objective': 'binary:logistic', 
        'eval_metric': 'logloss',   
        'random_state': 0,
        'n_jobs': -1,       
        'tree_method': 'hist'  
    }

    model = XGBClassifier(**xg_params)

elif use_model == 'AutoGluon':
    monster_params = {'label': target}
    model = TabularPredictor(**monster_params)

else:
    raise ValueError('Not a valid model!')

### Training / Cross Validation

In [None]:
ml_model = ModelTrainer(model=model, use_model=use_model)

if cross_validation:
    if use_model == 'AutoGluon':
        ml_model.validate_model(
            X=X_train, 
            Y=Y_train, 
            time_limit=3600, 
            num_gpus=1, 
            num_cpus=1
        )
    else:
        ml_model.cross_validate(X=X_train, Y=Y_train)

else:
    if use_model == 'AutoGluon':
        X_train = pd.concat((X_train, Y_train), axis=1)
        ml_model.model.fit(
            train_data=X_train,
            time_limit=100, 
            verbosity=3,
            num_gpus=1,
            num_cpus=1
        )
    else:
        ml_model.model.fit(X_train, Y_train)

    prediction = ml_model.predict(X_test)
    ml_model.save_submission(pred=prediction, test_set=test_df)