# Bitfarms At-home Task

- Submit two .ipynb workbooks – one for each strategy.
- Submit the cumulative returns for each strategy – as csv or json.

In [1]:
# import libraries
import pandas as pd
import numpy as np
import talib as ta
from datetime import datetime, timedelta
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
import warnings
import os
import glob
warnings.simplefilter(action='ignore', category=FutureWarning)

## Split Dataset
- Split each dataset into in-sample and out-of-sample using 1st Jan 2023 as the split point.

In [2]:
# Load the datasets
btc_hist = pd.read_csv('btc-hist.csv')

# Convert the 'time' column to datetime format
btc_hist['time'] = pd.to_datetime(btc_hist['time'])

# Set the 'time' column as the index
btc_hist = btc_hist.set_index('time')

# Split the datasets into in-sample and out-of-sample data
split_date =pd.to_datetime('2023-01-01')
btc_in_sample = btc_hist[:split_date]
btc_out_sample = btc_hist[split_date:]

# Save the datasets to parquet files
btc_in_sample.to_parquet('btc_in_sample.parquet')
btc_out_sample.to_parquet('btc_out_sample.parquet')

# Load the datasets from parquet files
btc_in_sample = pd.read_parquet('btc_in_sample.parquet')
btc_out_sample = pd.read_parquet('btc_out_sample.parquet')

## In-Sample development and testing 
1. Create an array of lookback periods from 5 to 90 inclusive in increments of 5.
2. For each lookback period, fit each of the following volatility indicators:
- Bollinger Bands
- Keltner Channel
- Donchian Channel

    <i>Note: Besides lookback period, you will need to provide other parameters that are unique to each indicator – use discretion but be sensible.</i>

3. For each combination of indicator and lookback, calculate the following:
- Distance from closing price to lower band.
- Distance from closing price to upper band.
- Distance between upper and lower band (channel breadth).

    <i>Note: it your choice whether you use price or some transformation such as log price.</i>

In [3]:
# Function to compute Bollinger Bands
def bollinger_bands(df, lookback):
    df['MA'] = df['close'].rolling(window=lookback).mean()
    df['BB_Upper'] = df['MA'] + 2 * df['close'].rolling(window=lookback).std()
    df['BB_Lower'] = df['MA'] - 2 * df['close'].rolling(window=lookback).std()
    return df

# Function to compute Keltner Channel
def keltner_channel(df, lookback):
    df['TR'] = ta.TRANGE(df['high'], df['low'], df['close'])
    df['ATR'] = df['TR'].rolling(window=lookback).mean()
    df['KC_Middle'] = df['close'].rolling(window=lookback).mean()
    df['KC_Upper'] = df['KC_Middle'] + 2 * df['ATR']
    df['KC_Lower'] = df['KC_Middle'] - 2 * df['ATR']
    return df

# Function to compute Donchian Channel
def donchian_channel(df, lookback):
    df['Don_Upper'] = df['high'].rolling(window=lookback).max()
    df['Don_Lower'] = df['low'].rolling(window=lookback).min()
    return df

# Array of lookback periods
lookback_periods = list(range(5, 91, 5))

# List to store features
features = []

# Generate features for each lookback period
for lookback in lookback_periods:
    btc_features = bollinger_bands(btc_in_sample.copy(), lookback)
    btc_features = keltner_channel(btc_features, lookback)
    btc_features = donchian_channel(btc_features, lookback)
    
    btc_features['BB_Distance_Lower'] = (btc_features['close'] - btc_features['BB_Lower']) / btc_features['close']
    btc_features['BB_Distance_Upper'] = (btc_features['BB_Upper'] - btc_features['close']) / btc_features['close']
    btc_features['BB_Breadth'] = (btc_features['BB_Upper'] - btc_features['BB_Lower']) / btc_features['close']
    
    btc_features['KC_Distance_Lower'] = (btc_features['close'] - btc_features['KC_Lower']) / btc_features['close']
    btc_features['KC_Distance_Upper'] = (btc_features['KC_Upper'] - btc_features['close']) / btc_features['close']
    btc_features['KC_Breadth'] = (btc_features['KC_Upper'] - btc_features['KC_Lower']) / btc_features['close']
    
    btc_features['Don_Distance_Lower'] = (btc_features['close'] - btc_features['Don_Lower']) / btc_features['close']
    btc_features['Don_Distance_Upper'] = (btc_features['Don_Upper'] - btc_features['close']) / btc_features['close']
    btc_features['Don_Breadth'] = (btc_features['Don_Upper'] - btc_features['Don_Lower']) / btc_features['close']
    
    features.append((lookback, btc_features))

# Save the features to csv files for future reference
for lookback, btc_features in features:
    btc_features.to_csv(f'btc_features_{lookback}.csv')

## Scale and Normalize 
4. From step 3, you will have generated a set of features. Scale and normalized these features so that they are bounded and stationary and that these conditions remain true even during intervals of elevated volatility. Perform other treatments as you see fit to enhance the signal fidelity of these features (i.e. ability to explain variations).

In [4]:
# Function to scale and normalize features
def scale_features(df, feature_cols):
    scaler = StandardScaler()
    df[feature_cols] = scaler.fit_transform(df[feature_cols])
    return df

# List of feature columns
feature_cols = [
    'BB_Distance_Lower', 'BB_Distance_Upper', 'BB_Breadth',
    'KC_Distance_Lower', 'KC_Distance_Upper', 'KC_Breadth',
    'Don_Distance_Lower', 'Don_Distance_Upper', 'Don_Breadth'
]

# Scale and normalize the features for each lookback period
scaled_features = []
for lookback, btc_features, in features:
    btc_features_scaled = scale_features(btc_features.copy(), feature_cols)
    scaled_features.append((lookback, btc_features_scaled))


## Develop Trading Strategies using PCA and Random Forest
5. Create two systematic trading strategies based on these features.
- Using PCA
- Using Random Forest

### PCA-based trading strategy

In [5]:
# Function to create PCA-based trading strategy
def pca_strategy(df, feature_cols, n_components=2):
    # Fill in missing values with mean
    imputer = SimpleImputer(strategy='mean')
    df[feature_cols] = imputer.fit_transform(df[feature_cols])
    pca = PCA(n_components=n_components)
    pca_features = pca.fit_transform(df[feature_cols])
    
    # Simple threshold-based trading strategy
    df['PCA_Signal'] = np.where(pca_features[:, 0] > 0, 1, -1)
    df['PCA_Returns'] = df['close'].pct_change().shift(-1) * df['PCA_Signal']
    df['Cumulative_PCA_Returns'] = (1 + df['PCA_Returns']).cumprod() - 1
    
    return df

# Apply PCA strategy to each lookback period
pca_strategies = []
for lookback, btc_features_scaled in scaled_features:
    btc_pca = pca_strategy(btc_features_scaled.copy(), feature_cols)
    pca_strategies.append((lookback, btc_pca))

# Save cumulative returns for PCA strategy
for lookback, btc_pca in pca_strategies:
    btc_pca[['Cumulative_PCA_Returns']].to_csv(f'btc_pca_returns_{lookback}.csv', index=False)
    


### Random Forest based trading strategy

In [6]:
# Function to create Random Forest-based trading strategy
def random_forest_strategy(df, feature_cols):
    df['Target'] = np.where(df['close'].shift(-1) > df['close'], 1, 0)
    X = df[feature_cols].values
    y = df['Target'].values
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    df['RF_Signal'] = rf.predict(X)
    df['RF_Returns'] = df['close'].pct_change().shift(-1) * (df['RF_Signal'] * 2 - 1)
    df['Cumulative_RF_Returns'] = (1 + df['RF_Returns']).cumprod() - 1
    
    return df

# Apply Random Forest strategy to each lookback period
rf_strategies = []
for lookback, btc_features_scaled in scaled_features:
    btc_rf = random_forest_strategy(btc_features_scaled.copy(), feature_cols)
    rf_strategies.append((lookback, btc_rf))

# Save cumulative returns for Random Forest strategy
for lookback, btc_rf in rf_strategies:
    btc_rf[['Cumulative_RF_Returns']].to_csv(f'btc_rf_returns_{lookback}.csv', index=False)


## Measure Performance Metrics
6. Measure the Average Win Rate (%), Number of Trades, Sharpe, Ulcer Performance Index and granular Profit Factor for each strategy. By ‘granular’ Profit Factor, you need to ensure that the returns used in that calculation are based on hourly mark to market as per the frequency of the data provided.

In [7]:
# function for performance metrics
def calculate_performance_metrics(df, strategy_col):
    # Calculate basic metrics
    total_return = df[strategy_col].iloc[-1]
    annualized_return = (1 + total_return) ** (252 / len(df)) - 1
    annualized_volatility = df[strategy_col].pct_change().std() * np.sqrt(252)
    sharpe_ratio = annualized_return / annualized_volatility
    
    # Calculate drawdowns
    df['Drawdown'] = df[strategy_col] - df[strategy_col].cummax()
    max_drawdown = df['Drawdown'].min()

    # Calculate Average Win Rate Percentage
    df['Returns'] = df[strategy_col].pct_change()
    df['Wins'] = df['Returns'] > 0
    avg_win_rate = df['Wins'].mean() * 100
    
    # Calculate Number of Trades
    number_of_trades = df['Returns'].count()

    # Calculate Ulcer Performance Index
    df['Drawdown_Pct'] = df['Drawdown'] / df[strategy_col].cummax()
    ulcer_index = np.sqrt(np.mean(df['Drawdown_Pct'] ** 2))
    ulcer_performance_index = annualized_return / ulcer_index if ulcer_index != 0 else np.nan

    # Calculate Profit Factor
    total_gains = df[df['Returns'] > 0]['Returns'].sum()
    total_losses = abs(df[df['Returns'] < 0]['Returns'].sum())
    profit_factor = total_gains / total_losses if total_losses != 0 else np.nan

    return total_return, annualized_return, annualized_volatility, sharpe_ratio, max_drawdown, avg_win_rate, number_of_trades, ulcer_performance_index, profit_factor

# Calculate performance metrics for each strategy
pca_performance = []
rf_performance = []

for lookback, btc_pca in pca_strategies:
    pca_metrics_btc = calculate_performance_metrics(btc_pca, 'Cumulative_PCA_Returns')
    pca_performance.append((lookback, 'BTC', *pca_metrics_btc))
    
for lookback, btc_rf in rf_strategies:
    rf_metrics_btc = calculate_performance_metrics(btc_rf, 'Cumulative_RF_Returns')
    rf_performance.append((lookback, 'BTC', *rf_metrics_btc))

# Save performance metrics to CSV
pca_performance_df = pd.DataFrame(pca_performance, columns=['Lookback', 'Asset', 'Total_Return', 'Annualized_Return', 'Annualized_Volatility',
                                                            'Sharpe_Ratio', 'Max_Drawdown', 'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                            'Profit_Factor'])
rf_performance_df = pd.DataFrame(rf_performance, columns=['Lookback', 'Asset', 'Total_Return', 'Annualized_Return', 'Annualized_Volatility',
                                                          'Sharpe_Ratio', 'Max_Drawdown', 'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                          'Profit_Factor'])

pca_performance_df.to_csv('pca_performance.csv', index=False)
rf_performance_df.to_csv('rf_performance.csv', index=False)

## Monte Carlo Permutation Test
7. Perform a Monte Carlo Permutation Test (1000 repetitions) using the in-sample data to test each strategy. For each permutation run, calculate the Average Win Rate (%), Number of Trades, Sharpe, Ulcer Performance Index and granular Profit Factor. Average these measures across the 1000 repetitions and come these to your strategy’s performance on the actual in-sample data.

In [8]:
# monte carlo
def monte_carlo_permutation_test(df, strategy_col, n_iterations=1000):
    metrics = []
    for _ in range(n_iterations):
        permuted_returns = df[strategy_col].sample(frac=1, replace=False).reset_index(drop=True)
        df['Permuted_Returns'] = permuted_returns.cumsum()
        metrics.append(calculate_performance_metrics(df, 'Permuted_Returns'))
    
    return np.mean(metrics, axis=0), np.std(metrics, axis=0)

# Perform Monte Carlo Permutation Test for each strategy
pca_monte_carlo = []
rf_monte_carlo = []
for lookback, btc_pca in pca_strategies:
    pca_metrics, pca_std = monte_carlo_permutation_test(btc_pca, 'Cumulative_PCA_Returns')
    pca_monte_carlo.append((lookback, 'BTC', *pca_metrics, *pca_std))
    
for lookback, btc_rf in rf_strategies:
    rf_metrics, rf_std = monte_carlo_permutation_test(btc_rf, 'Cumulative_RF_Returns')
    rf_monte_carlo.append((lookback, 'BTC', *rf_metrics, *rf_std))

# Save Monte Carlo results to CSV
pca_monte_carlo_df = pd.DataFrame(pca_monte_carlo, columns=['Lookback', 'Asset', 'Total_Return', 'Annualized_Return', 'Annualized_Volatility',
                                                            'Sharpe_Ratio', 'Max_Drawdown', 'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                            'Profit_Factor', 'Total_Return_Std', 'Annualized_Return_Std', 'Annualized_Volatility_Std',
                                                            'Sharpe_Ratio_Std', 'Max_Drawdown_Std', 'Avg_Win_Rate_Std', 'Number_of_Trades_Std',
                                                            'Ulcer_Performance_Index_Std', 'Profit_Factor_Std'])
rf_monte_carlo_df = pd.DataFrame(rf_monte_carlo, columns = ['Lookback', 'Asset', 'Total_Return', 'Annualized_Return', 'Annualized_Volatility',
                                                            'Sharpe_Ratio', 'Max_Drawdown', 'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                            'Profit_Factor', 'Total_Return_Std', 'Annualized_Return_Std', 'Annualized_Volatility_Std',
                                                            'Sharpe_Ratio_Std', 'Max_Drawdown_Std', 'Avg_Win_Rate_Std', 'Number_of_Trades_Std',
                                                            'Ulcer_Performance_Index_Std', 'Profit_Factor_Std'])

pca_monte_carlo_df.to_csv('pca_monte_carlo.csv', index=False)
rf_monte_carlo_df.to_csv('rf_monte_carlo.csv', index=False)


## Validate Strategies on Out-of-Sample Data
8. Run each strategy on the out of sample data and calculate the performance metrics. Perform the same Monte Carlo Permutation Test on the OOS data and come the performance of permuted runs against performance on actual OOS data.

In [18]:
# Apply PCA and Random Forest strategies to out-of-sample data
out_sample_pca_strategies = []
out_sample_rf_strategies = []
for lookback, _ in scaled_features:
    btc_out_sample_features = pd.read_csv(f'btc_features_{lookback}.csv')
    btc_out_sample_pca = pca_strategy(btc_out_sample_features.copy(), feature_cols)
    out_sample_pca_strategies.append((lookback, btc_out_sample_pca))
    
    btc_out_sample_rf = random_forest_strategy(btc_out_sample_features.copy(), feature_cols)
    out_sample_rf_strategies.append((lookback, btc_out_sample_rf))

# Calculate performance metrics for out-of-sample data
out_sample_pca_performance = []
out_sample_rf_performance = []
for lookback, btc_pca in out_sample_pca_strategies:
    pca_metrics = calculate_performance_metrics(btc_pca, 'Cumulative_PCA_Returns')
    out_sample_pca_performance.append((lookback, 'BTC', *pca_metrics))
    
for lookback, btc_rf in out_sample_rf_strategies:
    rf_metrics = calculate_performance_metrics(btc_rf, 'Cumulative_RF_Returns')
    out_sample_rf_performance.append((lookback, 'BTC', *rf_metrics))

# Save out-of-sample performance metrics to CSV
out_sample_pca_performance_df = pd.DataFrame(out_sample_pca_performance, columns =['Lookback', 'Asset', 'Total_Return', 'Annualized_Return',
                                                                                   'Annualized_Volatility', 'Sharpe_Ratio', 'Max_Drawdown',
                                                                                   'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                                                   'Profit_Factor'])
out_sample_rf_performance_df = pd.DataFrame(out_sample_rf_performance, columns = ['Lookback', 'Asset', 'Total_Return', 'Annualized_Return',
                                                                                  'Annualized_Volatility', 'Sharpe_Ratio', 'Max_Drawdown',
                                                                                  'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                                                  'Profit_Factor'])

out_sample_pca_performance_df.to_csv('out_sample_pca_performance.csv', index=False)
out_sample_rf_performance_df.to_csv('out_sample_rf_performance.csv', index=False)

### Perform Monte Carlos on Out-of-Sample Data

In [10]:
# monte carlo
def monte_carlo_permutation_test(df, strategy_col, n_iterations=1000):
    metrics = []
    for _ in range(n_iterations):
        permuted_returns = df[strategy_col].sample(frac=1, replace=False).reset_index(drop=True)
        df['Permuted_Returns'] = permuted_returns.cumsum()
        metrics.append(calculate_performance_metrics(df, 'Permuted_Returns'))
    
    return np.mean(metrics, axis=0), np.std(metrics, axis=0)

# Perform Monte Carlo Permutation Test for each strategy
pca_monte_carlo = []
rf_monte_carlo = []
for lookback, btc_pca in pca_strategies:
    pca_metrics, pca_std = monte_carlo_permutation_test(btc_pca, 'Cumulative_PCA_Returns')
    pca_monte_carlo.append((lookback, 'BTC', *pca_metrics, *pca_std))
    
for lookback, btc_rf in rf_strategies:
    rf_metrics, rf_std = monte_carlo_permutation_test(btc_rf, 'Cumulative_RF_Returns')
    rf_monte_carlo.append((lookback, 'BTC', *rf_metrics, *rf_std))

# Save Monte Carlo results to CSV
pca_monte_carlo_dfo = pd.DataFrame(pca_monte_carlo, columns=['Lookback', 'Asset', 'Total_Return', 'Annualized_Return', 'Annualized_Volatility',
                                                             'Sharpe_Ratio', 'Max_Drawdown', 'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                             'Profit_Factor', 'Total_Return_Std', 'Annualized_Return_Std', 'Annualized_Volatility_Std',
                                                             'Sharpe_Ratio_Std', 'Max_Drawdown_Std', 'Avg_Win_Rate_Std', 'Number_of_Trades_Std',
                                                             'Ulcer_Performance_Index_Std', 'Profit_Factor_Std'])
rf_monte_carlo_dfo = pd.DataFrame(rf_monte_carlo, columns = ['Lookback', 'Asset', 'Total_Return', 'Annualized_Return', 'Annualized_Volatility',
                                                             'Sharpe_Ratio', 'Max_Drawdown', 'Avg_Win_Rate', 'Number_of_Trades', 'Ulcer_Performance_Index',
                                                             'Profit_Factor', 'Total_Return_Std', 'Annualized_Return_Std', 'Annualized_Volatility_Std',
                                                             'Sharpe_Ratio_Std', 'Max_Drawdown_Std', 'Avg_Win_Rate_Std', 'Number_of_Trades_Std',
                                                             'Ulcer_Performance_Index_Std', 'Profit_Factor_Std'])

pca_monte_carlo_dfo.to_csv('pca_monte_carlo_o.csv', index=False)
rf_monte_carlo_dfo.to_csv('rf_monte_carlo_o.csv', index=False)

## Apply Meta-Labelling
9. Apply meta-labelling to each strategy and demonstrate the extent to which the strategy’s performance is enhanced through that corrective exercise.

In [12]:
# Meta-labelling: Use the results of the primary model as features for a secondary model
def apply_meta_labelling(df, primary_strategy_col, feature_cols):
    df['Primary_Signal'] = np.where(df[primary_strategy_col].shift(-1) > df[primary_strategy_col], 1, 0)
    X = df[feature_cols + ['Primary_Signal']].values
    y = np.where(df['close'].shift(-1) > df['close'], 1, 0)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    df['Meta_Signal'] = rf.predict(X)
    df['Meta_Returns'] = df['close'].pct_change().shift(-1) * (df['Meta_Signal'] * 2 - 1)
    df['Cumulative_Meta_Returns'] = (1 + df['Meta_Returns']).cumprod() - 1
    
    return df

# Apply meta-labelling to each strategy
meta_labelled_strategies = []
for lookback, btc_pca in pca_strategies:
    btc_meta = apply_meta_labelling(btc_pca.copy(), 'Cumulative_PCA_Returns', feature_cols)
    meta_labelled_strategies.append((lookback, btc_meta))

for lookback, btc_rf in rf_strategies:
    btc_meta = apply_meta_labelling(btc_rf.copy(), 'Cumulative_RF_Returns', feature_cols)
    meta_labelled_strategies.append((lookback, btc_meta))

# Save cumulative returns for meta-labelled strategy
for lookback, btc_meta in meta_labelled_strategies:
    btc_meta[['Cumulative_Meta_Returns']].to_csv(f'btc_meta_returns_{lookback}.csv', index=False)

### Compare In-Sample to Out-of Sample and Meta-Labelled returns

In [13]:
# Set the directory path
directory = "/Users/qboy/Downloads/bitf/"
# Use glob to match the pattern 'btc_pca_returns*'
pattern = os.path.join(directory, "btc_pca_returns*.csv")
# List to store individual dataframes
dfs = []
# Loop through the CSV files and append to the list
for file in glob.glob(pattern):
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes in the list
combined_df = pd.concat(dfs, ignore_index=True)
# Save the combined dataframe to a new CSV file
combined_df.to_csv(os.path.join(directory, "combined_btc_pca_returns.csv"), index=False)

# Use glob to match the pattern 'btc_rf_returns*'
pattern = os.path.join(directory, "btc_rf_returns*.csv")
# List to store individual dataframes
dfs = []
# Loop through the CSV files and append to the list
for file in glob.glob(pattern):
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes in the list
combined_df = pd.concat(dfs, ignore_index=True)
# Save the combined dataframe to a new CSV file
combined_df.to_csv(os.path.join(directory, "combined_btc_rf_returns.csv"), index=False)

# Use glob to match the pattern 'btc_meta_returns*'
pattern = os.path.join(directory, "btc_meta_returns*.csv")
# List to store individual dataframes
dfs = []
# Loop through the CSV files and append to the list
for file in glob.glob(pattern):
    df = pd.read_csv(file)
    dfs.append(df)

# Concatenate all dataframes in the list
combined_df = pd.concat(dfs, ignore_index=True)
# Save the combined dataframe to a new CSV file
combined_df.to_csv(os.path.join(directory, "combined_btc_meta_returns.csv"), index=False)

# read the combined dataframes
combined_pca = pd.read_csv('combined_btc_pca_returns.csv')
combined_rf = pd.read_csv('combined_btc_rf_returns.csv')
combined_meta = pd.read_csv('combined_btc_meta_returns.csv')

In [16]:
# read out of sample data
out_sample_pca_performance = pd.read_csv('out_sample_pca_performance.csv')
out_sample_rf_performance = pd.read_csv('out_sample_rf_performance.csv')
# compare the average return for each dataset 
print(f'Average Return for In-Sample PCA Strategy is {combined_pca.Cumulative_PCA_Returns.mean()}%.')
print(f'Average Return for In-Sample RF Strategy is {combined_rf.Cumulative_RF_Returns.mean()}%.')
print(f'Average Return for Out-of-Sample PCA Strategy is {out_sample_pca_performance.Total_Return.mean()}%.')
print(f'Average Return for Out-of-Sample RF Strategy is {out_sample_rf_performance.Total_Return.mean()}%.')
print(f'Average Return for Meta Label Strategy is {combined_meta.Cumulative_Meta_Returns.mean()}%.')

Average Return for In-Sample PCA Strategy is -0.9222181652809942%.
Average Return for In-Sample RF Strategy is 1.559678924405005e+166%.
Average Return for Out-of-Sample PCA Strategy is nan%.
Average Return for Out-of-Sample RF Strategy is nan%.
Average Return for Meta Label Strategy is 7.763465679934562e+166%.


## Reflect on Limitations and Biases
10. Critically comment on the limitations and biases (both systematic and idiosyncratic) that pertain to each trading strategy – one that used PCA, and the other the used RF. Be as precise with your mathematical reasoning.
11. Critically comment on the limitations and biases (both systematic and idiosyncratic) that pertain to the use of meta-labelling to improve system performance. 

### PCA-Based Strategy: Limitations and Biases

#### **Systematic Limitations and Biases**

1. **Linear Assumptions:**
   - **Description:** Principal Component Analysis (PCA) assumes that the underlying data relationships are linear.
   - **Mathematical Reasoning:** PCA identifies the directions (principal components) that maximize variance under the constraint of orthogonality and linearity. This assumption might lead to suboptimal feature extraction if the true relationships between variables are nonlinear.
   - **Impact:** In financial time series, where nonlinear dynamics are common, PCA might overlook important interactions between features, leading to suboptimal trading signals.

2. **Stationarity Requirements:**
   - **Description:** PCA assumes that the input data is stationary (mean and variance are constant over time).
   - **Mathematical Reasoning:** The covariance matrix, which PCA relies on, is a function of the means and variances of the input features. Non-stationary data may distort the principal components, leading to unstable and unreliable results.
   - **Impact:** Financial markets are often non-stationary, with changing volatilities and trends. PCA might yield components that are only relevant for specific periods, leading to overfitting and poor out-of-sample performance.

3. **Variance Maximization:**
   - **Description:** PCA maximizes the variance captured by each principal component, implicitly assuming that higher variance equates to more important information.
   - **Mathematical Reasoning:** The first principal component explains the most variance, but this doesn't necessarily correspond to the most predictive or relevant feature for trading.
   - **Impact:** In trading, high variance might capture noise or market shocks rather than consistent signals. This bias towards variance can lead to strategies that are overly sensitive to volatile periods, resulting in erratic trading performance.

#### **Idiosyncratic Limitations and Biases**

1. **Sensitivity to Outliers:**
   - **Description:** PCA is sensitive to outliers in the data.
   - **Mathematical Reasoning:** Outliers can disproportionately affect the covariance matrix, distorting the principal components and leading to misleading conclusions about the data's structure.
   - **Impact:** In financial markets, outliers often occur due to sudden news or market events. PCA-based strategies may react strongly to these outliers, generating signals that are not reflective of underlying market trends.

2. **Fixed Components:**
   - **Description:** Once determined, PCA components are fixed and do not adapt to changing market conditions.
   - **Mathematical Reasoning:** PCA decomposes the data into a set of orthogonal vectors (components) based on historical data. These components do not change unless PCA is re-run.
   - **Impact:** As market dynamics evolve, the relevance of these fixed components may diminish, leading to outdated signals and deteriorating strategy performance.

### Random Forest-Based Strategy: Limitations and Biases

#### **Systematic Limitations and Biases**

1. **Overfitting:**
   - **Description:** Random Forests can easily overfit to the in-sample data.
   - **Mathematical Reasoning:** Random Forests are composed of many decision trees, each built on random subsets of features and data points. While this reduces variance, it can also lead to models that perfectly capture noise in the training data, especially if the number of trees is high or the depth of the trees is unrestricted.
   - **Impact:** Overfitting results in a model that performs well in-sample but poorly out-of-sample, as it fails to generalize to new data. This is particularly problematic in financial markets where conditions change frequently.

2. **Feature Importance Bias:**
   - **Description:** Random Forests can be biased towards features with more variability or a larger range.
   - **Mathematical Reasoning:** In Random Forests, features that split the data into more homogeneous groups are considered more important. However, features with greater variability or a wider range of values are more likely to create such splits, even if they are not the most predictive.
   - **Impact:** This can lead to a model that overemphasizes certain features, potentially ignoring more subtle but significant signals. The strategy might thus focus on noise rather than on meaningful trends.

3. **Non-Stationarity Issues:**
   - **Description:** Random Forests do not explicitly handle non-stationarity in the data.
   - **Mathematical Reasoning:** The model assumes that the relationships learned during training remain valid over time. However, in financial markets, these relationships can change due to shifts in market regimes, leading to a mismatch between training and live data.
   - **Impact:** This can result in significant degradation of performance when applied to out-of-sample data, as the model may be unable to adapt to new market conditions.

#### **Idiosyncratic Limitations and Biases**

1. **Complexity and Interpretability:**
   - **Description:** Random Forests, while powerful, are complex and difficult to interpret.
   - **Mathematical Reasoning:** A Random Forest is an ensemble of decision trees, each making a prediction. The final prediction is an average or majority vote, making it challenging to understand the contribution of individual features or trees.
   - **Impact:** This lack of transparency can be a drawback in trading, where understanding the rationale behind a model's decisions is crucial for risk management and strategy refinement.

2. **Sensitivity to Training Data:**
   - **Description:** The model's performance is highly sensitive to the quality and quantity of the training data.
   - **Mathematical Reasoning:** Random Forests are data-hungry and perform best when provided with large, diverse datasets. If the training data is not representative of future market conditions, the model's predictions will be unreliable.
   - **Impact:** This reliance on extensive historical data can be problematic in markets with limited data availability or where historical patterns do not repeat, leading to suboptimal trading decisions.

### Meta-Labelling: Limitations and Biases

#### **Systematic Limitations and Biases**

1. **Increased Model Complexity:**
   - **Description:** Meta-labelling adds an additional layer of complexity to the trading strategy.
   - **Mathematical Reasoning:** Meta-labelling involves training a secondary model that takes the output of the primary model as input features. This increases the dimensionality and complexity of the model, potentially leading to overfitting, especially if the meta-model is not sufficiently regularized.
   - **Impact:** Higher complexity can result in a strategy that is more prone to overfitting and harder to interpret. This may reduce the robustness of the strategy when applied to new data.

2. **Dependency on Primary Model:**
   - **Description:** The effectiveness of meta-labelling is heavily dependent on the quality of the primary model's predictions.
   - **Mathematical Reasoning:** If the primary model's signals are weak or noisy, the meta-label will inherit and possibly amplify these issues. The meta-model may also become overfitted to the idiosyncrasies of the primary model, rather than providing independent, corrective insights.
   - **Impact:** This dependency can limit the potential benefits of meta-labelling, as any biases or errors in the primary model are carried over to the secondary model, reducing overall strategy performance.

3. **Lag in Signal Generation:**
   - **Description:** Meta-labelling introduces a delay in the trading signal.
   - **Mathematical Reasoning:** The meta-label is typically generated based on the output of the primary model, which means the signal is inherently delayed by at least one period (e.g., day, hour). This delay can reduce the strategy's responsiveness to rapid market changes.
   - **Impact:** In fast-moving markets, this lag can lead to missed opportunities or increased slippage, diminishing the overall effectiveness of the strategy.

#### **Idiosyncratic Limitations and Biases**

1. **Data Snooping Bias:**
   - **Description:** The process of optimizing the meta-label based on historical performance can introduce data snooping bias.
   - **Mathematical Reasoning:** When the meta-model is trained on the same data used to develop and evaluate the primary model, there is a risk of over-optimizing based on past data, which may not be predictive of future performance.
   - **Impact:** This bias can lead to overly optimistic performance estimates and strategies that fail to perform well in out-of-sample or live trading.

2. **Difficulty in Implementation:**
   - **Description:** Implementing meta-labelling in practice can be challenging, especially in real-time trading environments.
   - **Mathematical Reasoning:** Meta-labelling requires the timely and accurate generation of primary model signals, as well as the ability to process these signals to generate meta-labels quickly enough to execute trades. This adds operational complexity and potential points of failure.
   - **Impact:** Any delays or errors in this process can lead to suboptimal trade execution, reducing the overall effectiveness of the strategy. Additionally, the added complexity may increase transaction costs and reduce net profitability.

---

<i>In summary, both PCA-based and Random Forest-based strategies have their inherent limitations and biases, stemming from their underlying mathematical assumptions and sensitivities. Meta-labelling, while potentially beneficial in enhancing strategy performance, introduces its own set of challenges, particularly related to complexity and dependency on the primary model. Understanding these limitations is crucial for developing robust trading strategies and managing risks effectively.</i>