### Optuna Optimization

- The purpose of the given function optimizer(trial) is to tune hyperparameters for a model using the Optuna library in Python. The objective of the function is to minimize the mean absolute error of any model given hyperparameters.

- The function begins by instantiating a trial with an objective function to be executed/optimized in an Optuna study. Next, the function suggests a hyperparameter range using Optuna's sampling algorithm.

- Depending on the model method provided (i.e., "xgboost", "Lasso", "ElasticNet", or "Ridge"), the function sets specific hyperparameters and suggests values for the hyperparameters to be optimized using trial.suggest_int() or trial.suggest_float(). For instance, for lasso, elasticnet and ridge it will suggest different values for alpha and l1_ratio. Whereas, for xgboost it will optimise hyperparameters such as n_estimators. 

- The function then creates an instance of the Backtester class and sets various parameters for it, such as the data to use for training, the hyperparameters to use, and the features to preprocess and model on. The Backtester class is used to run a backtest of the model on the training data and obtain predicted values for the target variable.

- The function retrieves the predicted values and calculates the mean absolute error using the mean_absolute_error() function from the scikit-learn library. The calculated mean absolute error is then returned as the output of the function.

In [40]:
%%capture
!pip install optuna
!pip install GPUtil
!pip install statsmodels
!pip install linearmodels
!pip install xgboost
!pip install quantstats
!pip install joblib
!pip install linearmodels
!pip install seaborn

In [40]:
%%capture
import sklearn
import pandas as pd 
from sklearn.metrics import mean_absolute_error 
import xgboost as xgb
import pandas as pd 
import optuna 
import numpy as np


# Statistical Analysis Libraries 
from statsmodels.regression.rolling import RollingOLS
from sklearn.linear_model import Lasso, ElasticNet, Ridge
from scipy.stats.mstats import winsorize
from sklearn.preprocessing import MaxAbsScaler, PowerTransformer, MinMaxScaler, QuantileTransformer, RobustScaler
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn import set_config
from sklearn.base import BaseEstimator, TransformerMixin, clone # How to create our own scaler 
import statsmodels.api as sm
import seaborn as sns 

import multiprocessing
import tqdm 
from tqdm import tqdm 


# To automatically load changes in different files 
%load_ext autoreload
%autoreload 2

---
### Importing data

In [41]:
training_df = pd.read_csv('training_df_extra_features.csv', index_col = [0])
# Add duplicate beta factor columns for constraint purposes
training_df['cons_beta_mktrf'] = training_df['beta_mktrf']
training_df['cons_beta_smb'] = training_df['beta_smb']
training_df['cons_beta_hml'] = training_df['beta_hml']
training_df['cons_beta_mom'] = training_df['beta_mom']

# Adding portfolio betas for benchmark
training_df['vw'] = training_df['mktcap']/training_df.groupby(['date'])['mktcap'].transform('sum')
beta_list = ['beta_mktrf', 'beta_smb', 'beta_hml', 'beta_mom']
for beta in beta_list:
    training_df[f'{beta}_bench'] = training_df.groupby('date').apply(lambda df: df[beta] * df['vw']).reset_index().iloc[:, -1]
    training_df[f'{beta}_bench'] = training_df.groupby('date')[f'{beta}_bench'].transform(sum)
    
training_df['date'] = pd.to_datetime(training_df['date'])
training_df = training_df.dropna()
training_df = training_df.rename(columns = {'av_atmcall': 'iv_atmcall',
                                                                        'av_otmput': 'iv_otmput', 
                                                                        'av_otmcall': 'iv_otmcall'}
)

---
### Add winsorization pipeline

In [42]:
import class_backtester

pipept = Pipeline([
    ('ws', class_backtester.Winsorize(level_winsorize = 0.025)),
    ('qt', PowerTransformer()),
    ('maxabs', MaxAbsScaler())
])

---
### Define features used for our linear models (Lasso, Ridge, ElasticNet)

In [43]:
feature_names = list(training_df.columns)
list_to_remove = ["permno", "date", "secid"]
feature_names_final  = list(set(feature_names) - set(list_to_remove))

lag_features = [x for x in training_df.columns if 'lag' in x]
glb_features = [x for x in training_df.columns if 'glb' in x]
mfis_features = [x for x in training_df.columns if 'mfis' in x]
beta_original_features = [x for x in training_df.columns if '_original' in x]

# these columns are not predictors 
not_modeling_features = ['permno', 'date', 'secid', 'fret1d'] + beta_original_features

# these columns should not be winsorized
# glb and mfis features are already winsorized (0.025 level for each maturity on the monthly basis)
columns_to_not_preprocess = not_modeling_features + glb_features + mfis_features + lag_features

preprocess_features = list(set(training_df.columns) - set(columns_to_not_preprocess))
modeling_features = list(set(training_df.columns) - set(not_modeling_features))

---
### Define features used for our non-linear models (XGBoost, Random Forest)

In [44]:
feature_names_nonlinear = [
       'mean2w', 'lag_10', 'lag_7', 'lag_4', 'lag_8', 'ret', 'mfis91', 'std2w',
       'lag_17', 'idvar_ff4', 'glb3_D30', 'lag_19', 'lag_16', 'glb3_D91',
       'mom12m', 'lag_14', 'glb2_D30', 'lag_5', 'lag_21', 'rev1m_squared',
       'av_atmcall_squared', 'lag_9', 'av_atmcall_av_otmput', 'lag_13',
       'lag_20', 'beta_mom', 'lag_15', 'lag_6', 'lag_1', 'lag_11', 'mom6m',
       'lag_18', 'glb2_D91', 'av_atmcall_av_otmcall', 'lag_12', 'lag_2',
       'rev1m', 'lag_3', 'skew'
]

nonlinear = training_df[feature_names_nonlinear]

lag_features = [x for x in nonlinear.columns if 'lag' in x]
glb_features = [x for x in nonlinear.columns if 'glb' in x]
mfis_features = [x for x in nonlinear.columns if 'mfis' in x]
beta_original_features = [x for x in nonlinear.columns if '_original' in x]

# # these columns are not predictors 
not_modeling_features = ['permno', 'date', 'secid', 'fret1d'] + beta_original_features

# # these columns should not be winsorized
# # glb and mfis features are already winsorized (0.025 level for each maturity on the monthly basis)
columns_to_not_preprocess = not_modeling_features + glb_features + mfis_features + lag_features

preprocess_features_nonlinear = list(set(nonlinear.columns) - set(columns_to_not_preprocess))
modeling_features_nonlinear = list(set(nonlinear.columns) - set(not_modeling_features))


In [45]:
training_df['date'] = pd.to_datetime(training_df['date'])

In [46]:
training_df.date.max()

Timestamp('2015-12-31 00:00:00')

---
### Optimizer 

In [47]:
def optimizer(trial): 
    '''
    This function instantiates a trial with an objective function to be executed/optimized in a study in Optuna.
    We would like to optimize our models by tuning the hyperparameters for each of the models. 
    The metric that we would like to optimize for mean absolute error.
    
    :return metric: mean absolute error of any model given hyperparameters
    '''

    # Suggest a hyperparamter range within Optuna. This lets Optuna pick a set of hyperparameters
    # Using its sampling algorithm.
    
    config_name = model_method
    
    test_config = { config_name: {'opt_function': "MVP",'alpha_estimation_method':model_method, 'parsing': clone(pipept)}}
    
    params = None 
    # XGboost Optimization 
    if model_method == "xgboost":   
        params = {} 
        #if torch.cuda.is_available():
        params["tree_method"] = "gpu_hist"
        params["gpu_id"] = 0

        # Fix parameters 
        params["verbosity"] = 0
        params["random_state"] = 0 
        params["objective"] = "reg:linear"
        
        # Parameters to optimize 
        params["min_child_weight"] = trial.suggest_int('min_child_weight', 1, 6, log = True) 
        params["n_estimators"] = 100 # trial.suggest_int('n_estimators', 100, 125, log = True) 
        params["learning_rate"] = trial.suggest_float('learning_rate', 0.01, 0.3)
        params["alpha"] = trial.suggest_loguniform('alpha', 1e-4,0.01)
        params["max_depth"] = trial.suggest_int('max_depth', 4,15)
                
    elif model_method == "Lasso":
        
        alpha = trial.suggest_loguniform('alpha',  0.1, 0.13)
        test_config[config_name]["alpha"] = alpha
    
    
    elif model_method == "ElasticNet":
        
        alpha = trial.suggest_loguniform('alpha', 1e-4, 3)
        l1_ratio = trial.suggest_uniform('l1_ratio', 0.1, 0.9)
        test_config[config_name]["alpha"] = alpha
        test_config[config_name]["l1_ratio"] = l1_ratio
        
    
    elif model_method == "Ridge":
        
        alpha = trial.suggest_float('alpha', 0.00001, 3)
        test_config[config_name]["alpha"] = alpha
        
        
    from class_backtester import Backtester as bk 
    backtester = bk(
                        df = training_df, 
                        params = params,
                        optimise = False, # make it so that it can be equals to none 
                        preprocess_features = preprocess_features,
                        modeling_features = modeling_features,
                        rolling_frw='1D',
                        look_back_prm=252, 
                        configurations= test_config, 
                        col_to_pred='fret1d',
                        days_avoid_bias=1
                        )

    # rets, weights, predictions = backtester.run_backtest()
    # predictions = backtester.run_backtest()
    backtester.run_backtest()
    
    global resutls_df
    # retrieving the prediction results 
    resutls_df = backtester.dict_all_predictions[model_method][["date", "fret1d","fret1d_pred"]]
    

    global raw_dict
    raw_dict = backtester.dict_all_predictions
    
    y_pred = resutls_df["fret1d_pred"]
    y_test = resutls_df["fret1d"]

    # change later 
    metric = mean_absolute_error(y_test, y_pred)
    
    return metric

---
### Ridge Regression Optimization 

In [58]:
# Create an Optuna study to execute many trials for optimization

model_method = "Ridge"

optuna_study = optuna.create_study(
    direction='minimize', 
    study_name="ridge_regression_optimization_final", 
    storage="sqlite:///optimizer_final.db", 
    load_if_exists=True)

In [None]:
# Start the optimization process. 
optuna_study.optimize(optimizer,
                      n_trials = 20
)

In [59]:
# Display the best trials within the study.
optuna_study = optuna_study.trials_dataframe()
optuna_study = optuna_study.dropna()
optuna_study.sort_values('value', ascending  = True, inplace = True)
optuna_study.reset_index(inplace=True, drop=True)
optuna_study.head(15)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_alpha,state
0,13,0.007882,2023-03-28 11:20:00.564842,2023-03-28 11:45:56.791048,0 days 00:25:56.226206,0.037649,COMPLETE
1,15,0.007896,2023-03-28 12:11:48.191793,2023-03-28 12:37:46.477824,0 days 00:25:58.286031,0.040813,COMPLETE
2,10,0.007954,2023-03-28 10:02:06.534027,2023-03-28 10:27:39.429277,0 days 00:25:32.895250,0.053717,COMPLETE
3,14,0.008545,2023-03-28 11:45:56.808016,2023-03-28 12:11:48.177290,0 days 00:25:51.369274,0.196062,COMPLETE
4,2,0.008604,2023-03-27 18:46:55.498811,2023-03-27 19:10:39.298610,0 days 00:23:43.799799,0.211933,COMPLETE
5,16,0.009957,2023-03-28 12:37:46.492495,2023-03-28 13:02:00.890412,0 days 00:24:14.397917,0.719459,COMPLETE
6,5,0.010744,2023-03-27 19:35:50.360733,2023-03-27 20:00:04.961218,0 days 00:24:14.600485,1.228309,COMPLETE
7,11,0.010966,2023-03-28 10:27:39.443855,2023-03-28 10:53:54.655543,0 days 00:26:15.211688,1.421775,COMPLETE
8,9,0.011108,2023-03-28 09:34:30.637765,2023-03-28 10:02:06.520527,0 days 00:27:35.882762,1.56061,COMPLETE
9,12,0.011139,2023-03-28 10:53:54.674542,2023-03-28 11:20:00.552339,0 days 00:26:05.877797,1.593149,COMPLETE


---
### Lasso Regression Optimization 

In [60]:
# Create an Optuna study to execute many trials for optimization
model_method = "Lasso"

optuna_study = optuna.create_study(
    direction='minimize', 
    study_name="lasso_regression_optimization", 
    storage="sqlite:///optimizer.db", 
    load_if_exists=True)

In [None]:
# Start the optimization process. 
optuna_study.optimize(optimizer,
                      n_trials = 20)

In [61]:
# Display the best trials within the study.
optuna_study = optuna_study.trials_dataframe()
optuna_study.sort_values('value', ascending  = True, inplace = True)
optuna_study = optuna_study.dropna()
optuna_study.reset_index(inplace=True, drop=True)
optuna_study.head(15)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_alpha,state
0,32,0.010906,2023-03-28 14:20:53.203877,2023-03-28 15:00:49.298787,0 days 00:39:56.094910,2.2e-05,COMPLETE
1,33,0.011576,2023-03-28 15:00:49.313437,2023-03-28 15:35:48.225945,0 days 00:34:58.912508,2.8e-05,COMPLETE
2,31,0.011974,2023-03-28 13:45:32.537534,2023-03-28 14:20:53.192377,0 days 00:35:20.654843,3.4e-05,COMPLETE
3,30,0.012134,2023-03-28 13:12:27.369455,2023-03-28 13:45:32.520317,0 days 00:33:05.150862,3.6e-05,COMPLETE
4,24,0.013791,2023-03-28 05:19:32.982263,2023-03-28 05:44:09.444654,0 days 00:24:36.462391,0.000104,COMPLETE
5,27,0.013806,2023-03-28 06:31:04.429903,2023-03-28 06:55:21.539266,0 days 00:24:17.109363,0.000105,COMPLETE
6,15,0.013822,2023-03-28 01:44:51.932618,2023-03-28 02:09:28.991451,0 days 00:24:37.058833,0.000106,COMPLETE
7,23,0.013848,2023-03-28 04:55:03.053004,2023-03-28 05:19:32.970754,0 days 00:24:29.917750,0.000108,COMPLETE
8,18,0.013881,2023-03-28 02:56:45.213068,2023-03-28 03:21:12.049742,0 days 00:24:26.836674,0.000111,COMPLETE
9,14,0.013882,2023-03-28 01:20:23.112221,2023-03-28 01:44:51.920434,0 days 00:24:28.808213,0.000111,COMPLETE


In [62]:
# Create an Optuna study to execute many trials for optimization
model_method = "ElasticNet"

optuna_study = optuna.create_study(
    direction='minimize', 
    study_name="elastic_net_regression_optimization_final_v5", 
    storage="sqlite:///optimizer.db", 
    load_if_exists=True)

In [None]:
optuna_study.optimize(optimizer,
                      n_trials = 20
                      )

In [64]:
# Display the best trials within the study.
optuna_study = optuna_study.trials_dataframe()
optuna_study.sort_values('value', ascending  = True, inplace = True)
optuna_study = optuna_study.dropna()
optuna_study.reset_index(inplace=True, drop=True)
optuna_study.head(15)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_alpha,params_l1_ratio,state
0,16,0.013274,2023-03-30 21:49:25.101916,2023-03-30 22:20:26.394052,0 days 00:31:01.292136,0.000148,0.386985,COMPLETE
1,6,0.013419,2023-03-28 17:41:10.505550,2023-03-28 18:11:07.795357,0 days 00:29:57.289807,0.000237,0.22768,COMPLETE
2,12,0.01353,2023-03-28 20:14:26.985422,2023-03-28 20:41:36.401665,0 days 00:27:09.416243,0.000189,0.357275,COMPLETE
3,17,0.013544,2023-03-30 22:20:26.412096,2023-03-30 22:51:58.161426,0 days 00:31:31.749330,0.000178,0.3941,COMPLETE
4,15,0.013839,2023-03-30 21:19:08.251847,2023-03-30 21:49:25.084708,0 days 00:30:16.832861,0.000252,0.355639,COMPLETE
5,11,0.014102,2023-03-28 19:49:12.504299,2023-03-28 20:14:26.970570,0 days 00:25:14.466271,0.000516,0.227055,COMPLETE
6,10,0.01428,2023-03-28 19:25:25.915904,2023-03-28 19:49:12.490798,0 days 00:23:46.574894,0.000567,0.817802,COMPLETE
7,7,0.01433,2023-03-28 18:11:07.813745,2023-03-28 18:36:21.550745,0 days 00:25:13.737000,0.000875,0.665226,COMPLETE
8,18,0.014528,2023-03-30 22:51:58.176071,2023-03-30 23:19:49.929260,0 days 00:27:51.753189,0.002136,0.453856,COMPLETE
9,9,0.014536,2023-03-28 19:01:11.339280,2023-03-28 19:25:25.902793,0 days 00:24:14.563513,0.006722,0.141187,COMPLETE


---
### Final Results and Configurations

After extensive model optimization we select the best parameters for each model as well as the default values in order to compare and prove whether the optimization algorithm improved not only predictive power but also returns. 

In [None]:
msr_final_configs = { 
    # MSR Trading Strategy: 
    'msr_lasso': {'opt_function': MSR, 'alpha_estimation_method':'Lasso', "alpha": 0.010906, 'parsing': clone(pipept)}, # best
    'msr_lasso_default': {'opt_function': MSR, 'alpha_estimation_method':'Lasso', "alpha": 1.0, 'parsing': clone(pipept)},
    'msr_ridge': {'opt_function': MSR, 'alpha_estimation_method':'Ridge', "alpha": 0.037649, 'parsing': clone(pipept)},
    'msr_ridge_default': {'opt_function': MSR, 'alpha_estimation_method':'Ridge', "alpha":  1.0, 'parsing': clone(pipept)},
    'msr_ElasticNet_opt': {'opt_function': class_backtester.MSR, 'alpha_estimation_method':'Lasso', "alpha": 0.000237, "l1_ratio":  0.227680, 'parsing': clone(pipept)},
    'msr_ElasticNet_default': {'opt_function': class_backtester.MSR, 'alpha_estimation_method':'Lasso', "alpha": 1.0, "l1_ratio": 0.1, "parsing": clone(pipept)}
}

mpv_final_configs = { 
    # MVP Trading Strategy: 
    'mvp_lasso': {'opt_function': MVP, 'alpha_estimation_method':'Lasso', "alpha": 0.010906, 'parsing': clone(pipept)},
    'mvp_lasso_default': {'opt_function': MVP, 'alpha_estimation_method':'Lasso', "alpha": 1.0, 'parsing': clone(pipept)},
    'mvp_ridge': {'opt_function': MVP, 'alpha_estimation_method':'Ridge', "alpha": 0.037649, 'parsing': clone(pipept)},
    'mvp_ridge_default': {'opt_function': MVP, 'alpha_estimation_method':'Ridge', "alpha":  1.0, 'parsing': clone(pipept)}
    } 

ls_final_configs = { 
    # Long Short Strategy Trading Strategy: Both of them using optimized parameters
    'ls_lasso': {'opt_function': LS, 'alpha_estimation_method':'Lasso', "alpha": 0.010906, 'parsing': clone(pipept)},
    'ls_ridge': {'opt_function': LS, 'alpha_estimation_method':'Ridge', "alpha": 0.037649, 'parsing': clone(pipept)}
    }
# The XGBoost and random forest models require a different configuration because we will use a different
# set of features for these particular models.
xgboost_rforest_configs = {
    # MSR & MVP & LS Trading Strategy for XGBoost and Random Forest:
    'msr_xgboost': {'opt_function': MSR, 'alpha_estimation_method': 'xgboost', 'parsing': clone(pipept)},
    'msr_rforest': {'opt_function': MSR, 'alpha_estimation_method': 'random_forest', 'parsing': clone(pipept)},
    'mvp_xgboost': {'opt_function': MVP, 'alpha_estimation_method': 'xgboost', 'parsing': clone(pipept)},
    'ls_xgboost': {'opt_function': LS, 'alpha_estimation_method': 'xgboost', 'parsing': clone(pipept)}
}
