# ⚡ XGBoost with Bayesian Optimization and Yeo-Johnson Transformation

This notebook implements an **XGBoost regression model** optimized with **Bayesian hyperparameter search (BayesSearchCV)** and enhanced through **Yeo-Johnson transformation**. It aims to predict the **next-day adjusted close price** of the Shanghai Stock Index based on both technical and macroeconomic indicators.

---

## 🧱 Workflow Summary

### 1. 📥 Data Acquisition
- Shanghai Stock Index (`000001.SS`) from Yahoo Finance
- Macroeconomic indicators:
  - **China CPI** (`CHNCPIALLMINMEI`)
  - **China Interest Rate** (`INTDSRCNM193N`)
  - Downloaded via [FRED](https://fred.stlouisfed.org)

### 2. 🔍 Feature Engineering & Preprocessing
- Seasonal moving averages with 3-period smoothing
- First-order differencing (`diff Adj Close`)
- Lag features (`t-6` to `t-3`)
- Seasonal de-trending on price and macroeconomic indicators
- Outlier removal using **Z-Score**
- **Yeo-Johnson transformation** applied to all features to stabilize variance and normalize skewness

### 3. ⚙️ Model Training
- Model: `XGBRegressor` within a `Pipeline`
- Hyperparameter tuning with `BayesSearchCV`
- Validation with `TimeSeriesSplit (n=10)`
- Features used: technical + macroeconomic + lagged features

### 4. 🔄 Data Splitting
- 70% Training, 10% Validation, 20% Test (chronologically)
- Final model retrained on `Train + Val` before test prediction

### 5. ♻️ Inverse Transformation
- Custom inverse Yeo-Johnson applied to bring predictions back to original scale

### 6. 📊 Evaluation Metrics
- **RMSE**, **MAE**, **MAPE**, and **R²** calculated on:
  - Training set
  - Validation set
  - Test set

---

## 📎 Paper Context

This notebook corresponds to **Section 3.3** and **Table 2** of the article:  
**"The Application and Effectiveness of Machine Learning and Deep Learning Methods in Analyzing and Predicting the Shanghai Stock Index"**

---

## ✅ Notes
- All data transformations are fit **only on the training set** and reused on validation/test to prevent data leakage.
- Yeo-Johnson lambdas are stored and consistently applied across all datasets.
- Predictions are made sequentially (one-by-one) to simulate real-world time series inference.


In [None]:
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-24.7.0-py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-24.7.0 scikit-optimize-0.10.2


In [None]:
!pip install pmdarima

Collecting pmdarima
  Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pmdarima
Successfully installed pmdarima-2.0.4


In [None]:
!pip install pandas_market_calendars

Collecting pandas_market_calendars
  Downloading pandas_market_calendars-4.4.1-py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.3/107.3 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting exchange-calendars>=3.3 (from pandas_market_calendars)
  Downloading exchange_calendars-4.5.5-py3-none-any.whl (196 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.1/196.1 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyluach (from exchange-calendars>=3.3->pandas_market_calendars)
  Downloading pyluach-2.2.0-py3-none-any.whl (25 kB)
Collecting korean-lunar-calendar (from exchange-calendars>=3.3->pandas_market_calendars)
  Downloading korean_lunar_calendar-0.3.1-py3-none-any.whl (9.0 kB)
Installing collected packages: korean-lunar-calendar, pyluach, exchange-calendars, pandas_market_calendars
Successfully installed exchange-calendars-4.5.5 korean-lunar-calendar-0.3.1 pandas_market_calendars-4.4.1 pyluach-2.2.0


In [None]:
import warnings
import statsmodels.api as sm
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from math import sqrt
from scipy import stats
import yfinance as yf
import seaborn as sns
from statsmodels.tsa.stattools import adfuller, kpss
from sklearn.model_selection import ParameterGrid, GridSearchCV, TimeSeriesSplit
from skopt import gp_minimize
from scipy.stats import boxcox, zscore, yeojohnson
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler, PowerTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor, plot_importance, plot_tree
import xgboost as xgb
from skopt import BayesSearchCV
from sklearn.model_selection import learning_curve
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pmdarima import auto_arima
import pandas_market_calendars as mcal
from pandas.tseries.offsets import DateOffset
from statsmodels.tsa.arima.model import ARIMA
from scipy.special import inv_boxcox
import itertools
import pandas_datareader.data as web
import pandas_datareader as web

warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')


In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / (y_true + 1e-10))) * 100


## **get data, CPI, Interest Rate and make some features**

In [None]:
start='2010-01-04'
end='2020-01-23'

# Download Shanghai Stock Index data
data = yf.download('000001.SS', start, end)

# Use pandas_datareader to download China's CPI data from FRED
cpi_data = web.DataReader('CHNCPIALLMINMEI', 'fred', start, end)

# Use pandas_datareader to download China's Interest Rate data from FRED
interest_rate_data = web.DataReader('INTDSRCNM193N', 'fred', start, end)

# Merge all dataframes into one without dropping NaN values
data = pd.concat([data, cpi_data, interest_rate_data], axis=1)

data = data.reset_index()

data = data.dropna()

def make_stationary_Seasonal_Moving_Average_3(data, column):
    # Calculate the seasonal moving average
    data[f'{column}_Seasonal_Moving_Average_3'] = data[column].rolling(window=3).mean()

    # Remove the seasonal trend from the data
    data[f'{column}_Stationary_Seasonal_Moving_Average_3'] = data[column] - data[f'{column}_Seasonal_Moving_Average_3']

    return data

# Create a new column named ''diff Adj Close'' that is the first difference of 'Adj Close'
data['diff Adj Close'] = data['Adj Close'].diff().dropna()

# Create a stationary series from each non-stationary series
non_stationary_columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'CHNCPIALLMINMEI', 'INTDSRCNM193N']
for column in non_stationary_columns:
    data = make_stationary_Seasonal_Moving_Average_3(data, column)
for i in range(6, 2, -1):
    data['t-'+str(i)] = np.log(data["Adj Close"].shift(i))

data = data.dropna()
data = data.reset_index()

data


[*********************100%%**********************]  1 of 1 completed


Unnamed: 0,level_0,index,Open,High,Low,Close,Adj Close,Volume,CHNCPIALLMINMEI,INTDSRCNM193N,...,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3,INTDSRCNM193N_Seasonal_Moving_Average_3,INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
0,200,2010-11-01,2986.885986,3054.050049,2986.885986,3054.020996,3054.020996,186200.0,88.84555,2.79,...,2683.565023,370.455973,87.448167,1.397383,2.79,0.0,7.986627,8.035228,8.054337,7.850993
1,222,2010-12-01,2810.542969,2834.198975,2795.811035,2823.448975,2823.448975,81200.0,89.28943,3.25,...,2833.450684,-10.001709,88.467437,0.8219933,2.943333,0.306667,8.035228,8.054337,7.850993,7.772244
2,265,2011-02-01,2795.071045,2805.049072,2785.295898,2798.959961,2798.959961,73400.0,91.33717,3.25,...,2892.143311,-93.18335,89.82405,1.51312,3.096667,0.153333,8.054337,7.850993,7.772244,7.872029
3,280,2011-03-01,2906.284912,2931.580078,2901.552002,2918.919922,2918.919922,136600.0,91.15383,3.25,...,2847.109619,71.810303,90.593477,0.5603533,3.25,0.0,7.850993,7.772244,7.872029,8.024214
4,303,2011-04-01,2932.480957,2967.846924,2924.394043,2967.409912,2967.409912,99800.0,91.2455,3.25,...,2895.096598,72.313314,91.2455,1.421085e-14,3.25,0.0,7.772244,7.872029,8.024214,7.945714
5,344,2011-06-01,2737.056885,2744.559082,2726.495117,2743.572021,2743.572021,68600.0,91.61217,3.25,...,2876.633952,-133.06193,91.337167,0.2750033,3.25,0.0,7.872029,8.024214,7.945714,7.937003
6,365,2011-07-01,2767.833008,2778.667969,2752.966064,2759.362061,2759.362061,92000.0,92.01405,3.25,...,2823.447998,-64.085938,91.623907,0.3901433,3.25,0.0,8.024214,7.945714,7.937003,7.978969
7,386,2011-08-01,2697.574951,2712.887939,2688.529053,2703.782959,2703.782959,59000.0,92.27545,3.25,...,2735.572347,-31.789388,91.967223,0.3082267,3.25,0.0,7.945714,7.937003,7.978969,7.995445
8,409,2011-09-01,2569.799072,2584.803955,2547.854004,2556.041992,2556.041992,57200.0,92.71113,3.25,...,2673.062337,-117.020345,92.333543,0.3775867,3.25,0.0,7.937003,7.978969,7.995445,7.917016
9,447,2011-11-01,2450.331055,2491.35498,2445.5271,2470.019043,2470.019043,88800.0,92.62399,3.25,...,2576.614665,-106.595622,92.536857,0.08713333,3.25,0.0,7.978969,7.995445,7.917016,7.922755


In [None]:
data.columns

Index(['level_0', 'index', 'Open', 'High', 'Low', 'Close', 'Adj Close',
       'Volume', 'CHNCPIALLMINMEI', 'INTDSRCNM193N', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3',
       'CHNCPIALLMINMEI_Seasonal_Moving_Average_3',
       'CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3',
       'INTDSRCNM193N_Seasonal_Moving_Average_3',
       'INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5',
       't-4', 't-3'],
      dtype='object')

In [None]:
# Determine the length of the training data (70%)
train_len = int(len(data["Adj Close"]) * 0.7)

# Determine the length of the validation data (10%)
val_len = int(len(data["Adj Close"]) * 0.1)

# Set the training, validation, and test data
train_data = data.iloc[:train_len]
val_data = data.iloc[train_len:train_len + val_len]
test_data = data.iloc[train_len + val_len:]
# Selecting columns
columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
       'CHNCPIALLMINMEI', 'INTDSRCNM193N', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3',
       'CHNCPIALLMINMEI_Seasonal_Moving_Average_3',
       'CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3',
       'INTDSRCNM193N_Seasonal_Moving_Average_3',
       'INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5',
       't-4', 't-3']

# Calculating Z-Score for each column
z_scores = zscore(train_data[columns])

# Creating a training dataframe without outliers
train_data_without_outliers = train_data[(z_scores < 3).all(axis=1)]


# ***1) Scaling training data using yeojohnson transformation***

In [None]:
# Creating a copy of the training data without outliers
train_data_yeojohnson = train_data_without_outliers.copy()

# Applying the Yeo-Johnson transformation on the desired columns
columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume',
       'CHNCPIALLMINMEI', 'INTDSRCNM193N', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3',
       'CHNCPIALLMINMEI_Seasonal_Moving_Average_3',
       'CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3',
       'INTDSRCNM193N_Seasonal_Moving_Average_3',
       'INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5',
       't-4', 't-3']

for column in columns:
    # Apply Yeo-Johnson transformation and store lambda parameter
    train_data_yeojohnson[column], lam = scipy.stats.yeojohnson(train_data_yeojohnson[column])

    # Print lambda parameter for future use
    print(f'Lambda for {column}: {lam}')

train_data_yeojohnson


Lambda for Open: 0.315479755906296
Lambda for High: 0.2516480595154726
Lambda for Low: 0.37897651029596785
Lambda for Close: 0.06325975217652516
Lambda for Adj Close: 0.06325975217652516
Lambda for Volume: -0.8039347933036654
Lambda for CHNCPIALLMINMEI: -2.0283925889520793
Lambda for INTDSRCNM193N: 54.82045691074937
Lambda for diff Adj Close: 1.0339521559688147
Lambda for Open_Seasonal_Moving_Average_3: -1.5263975038020332
Lambda for Open_Stationary_Seasonal_Moving_Average_3: 1.0601518479305023
Lambda for High_Seasonal_Moving_Average_3: -1.6181686793783123
Lambda for High_Stationary_Seasonal_Moving_Average_3: 1.0617945030201263
Lambda for Low_Seasonal_Moving_Average_3: -1.4679973089920006
Lambda for Low_Stationary_Seasonal_Moving_Average_3: 1.0600526174468958
Lambda for Close_Seasonal_Moving_Average_3: -1.6275930602347688
Lambda for Close_Stationary_Seasonal_Moving_Average_3: 1.0518208509131413
Lambda for Adj Close_Seasonal_Moving_Average_3: -1.6275930602347688
Lambda for Adj Close_Sta

Unnamed: 0,level_0,index,Open,High,Low,Close,Adj Close,Volume,CHNCPIALLMINMEI,INTDSRCNM193N,...,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3,INTDSRCNM193N_Seasonal_Moving_Average_3,INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
0,200,2010-11-01,36.406553,25.962937,52.12202,10.454514,10.454514,1.24381,0.492947,9.601149e+29,...,0.614403,478.937783,2.381706,1.085146,2.96558e+26,-0.0,0.11767,2.507359,36.78252,0.209482
2,265,2011-02-01,35.586737,25.329251,50.691378,10.310074,10.310074,1.24373,0.49295,5.124195e+32,...,0.614403,-77.431079,2.388004,1.157528,1.307856e+28,0.004602,0.11767,2.480802,34.55646,0.209482
3,280,2011-03-01,36.066586,25.656296,51.523975,10.379479,10.379479,1.243789,0.49295,5.124195e+32,...,0.614403,85.496577,2.389997,0.4956913,7.819276e+28,-0.0,0.11767,2.469302,35.336281,0.209482
4,303,2011-04-01,36.177778,25.748085,51.685117,10.406777,10.406777,1.243763,0.49295,5.124195e+32,...,0.614403,86.12486,2.391669,1.441893e-14,7.819276e+28,-0.0,0.11767,2.483859,36.541643,0.209482
5,344,2011-06-01,35.331225,25.16899,50.26203,10.277084,10.277084,1.243721,0.492951,5.124195e+32,...,0.614403,-108.637684,2.391903,0.2575982,7.819276e+28,-0.0,0.11767,2.505785,35.917479,0.209482
6,365,2011-07-01,35.467229,25.259678,50.456018,10.286552,10.286552,1.243755,0.492951,5.124195e+32,...,0.614403,-54.232002,2.392632,0.3567199,7.819276e+28,-0.0,0.11767,2.494516,35.848532,0.209482
7,386,2011-08-01,35.155208,25.084025,49.98175,10.252998,10.252998,1.2437,0.492951,5.124195e+32,...,0.614403,-27.805347,2.393501,0.2866634,7.819276e+28,-0.0,0.11767,2.493261,36.181261,0.209482
8,409,2011-09-01,34.573185,24.732645,48.921258,10.160559,10.160559,1.243696,0.492952,5.124195e+32,...,0.614403,-96.151982,2.394424,0.346124,7.819276e+28,-0.0,0.11767,2.499301,36.312294,0.209482
9,447,2011-11-01,34.010809,24.467972,48.126803,10.104404,10.104404,1.243751,0.492952,5.124195e+32,...,0.614402,-87.991441,2.394934,0.08523262,7.819276e+28,-0.0,0.11767,2.501665,35.690579,0.209482
10,469,2011-12-01,33.731748,24.271268,47.58249,10.04835,10.04835,1.243762,0.492952,5.124195e+32,...,0.614402,-70.246147,2.395443,0.1400817,7.819276e+28,-0.0,0.11767,2.490375,35.735896,0.209482


# ***2) Scaling validation data using yeojohnson transformation***

In [None]:
# Creating a copy of the validation data
val_data_yeojohnson = val_data.copy()

# Storing the lambda parameters for each column in a dictionary
lambdas = {'Open': 0.315479755906296, 'High': 0.2516480595154726, 'Low': 0.37897651029596785,
           'Close': 0.06325975217652516, 'Adj Close': 0.06325975217652516, 'Volume': -0.8039347933036654,
           'CHNCPIALLMINMEI': -2.0283925889520793, 'INTDSRCNM193N': 54.82045691074937, 'diff Adj Close': 1.0339521559688147,
           'Open_Seasonal_Moving_Average_3': -1.5263975038020332, 'Open_Stationary_Seasonal_Moving_Average_3': 1.0601518479305023,
           'High_Seasonal_Moving_Average_3': -1.6181686793783123, 'High_Stationary_Seasonal_Moving_Average_3': 1.0617945030201263,
           'Low_Seasonal_Moving_Average_3': -1.4679973089920006, 'Low_Stationary_Seasonal_Moving_Average_3': 1.0600526174468958,
           'Close_Seasonal_Moving_Average_3': -1.6275930602347688, 'Close_Stationary_Seasonal_Moving_Average_3': 1.0518208509131413,
           'Adj Close_Seasonal_Moving_Average_3': -1.6275930602347688, 'Adj Close_Stationary_Seasonal_Moving_Average_3': 1.0518208509131413,
           'CHNCPIALLMINMEI_Seasonal_Moving_Average_3': -0.3196873271615117, 'CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3': 0.4773852839294433,
           'INTDSRCNM193N_Seasonal_Moving_Average_3': 48.66473521447997, 'INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3': -217.30550930174238,
           't-6': -8.498324117358287, 't-5': 0.1158936496368651, 't-4': 1.9443257455185823, 't-3': -4.773541547848075}

for column in columns:
    # Apply Yeo-Johnson transformation using the lambda parameter from the training data
    val_data_yeojohnson[column] = scipy.stats.yeojohnson(val_data_yeojohnson[column], lmbda=lambdas[column])

val_data_yeojohnson


Unnamed: 0,level_0,index,Open,High,Low,Close,Adj Close,Volume,CHNCPIALLMINMEI,INTDSRCNM193N,...,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3,INTDSRCNM193N_Seasonal_Moving_Average_3,INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
40,1712,2016-12-01,37.502137,26.504556,53.943023,10.569933,10.569933,1.243823,0.492961,4.607753e+30,...,0.614403,146.918507,2.418203,0.128978,1.193406e+27,-0.0,0.11767,2.499034,36.218089,0.209482
41,1772,2017-03-01,37.435247,26.458413,53.821779,10.556441,10.556441,1.243811,0.492961,4.607753e+30,...,0.614403,37.50449,2.418787,0.420239,1.193406e+27,-0.0,0.11767,2.499966,36.274593,0.209482
42,1835,2017-06-01,36.907499,26.108531,52.88287,10.48075,10.48075,1.243802,0.492961,4.607753e+30,...,0.614403,-86.738086,2.419297,0.128978,1.193406e+27,-0.0,0.11767,2.500986,36.565881,0.209482
43,1879,2017-08-01,37.57031,26.534827,54.056383,10.579756,10.579756,1.243823,0.492962,4.607753e+30,...,0.614403,93.961628,2.419949,0.220578,1.193406e+27,-0.0,0.11767,2.506219,36.718728,0.209483
44,1902,2017-09-01,37.92645,26.740876,54.609482,10.61711,10.61711,1.24383,0.492962,4.607753e+30,...,0.614403,137.572512,2.420599,0.474179,1.193406e+27,-0.0,0.11767,2.508949,37.097277,0.209482


# ***3) Scaling test data using yeojohnson transformation***

In [None]:
# Creating a copy of the test data
test_data_yeojohnson = test_data.copy()

# Storing the lambda parameters for each column in a dictionary
lambdas = {'Open': 0.315479755906296, 'High': 0.2516480595154726, 'Low': 0.37897651029596785,
           'Close': 0.06325975217652516, 'Adj Close': 0.06325975217652516, 'Volume': -0.8039347933036654,
           'CHNCPIALLMINMEI': -2.0283925889520793, 'INTDSRCNM193N': 54.82045691074937, 'diff Adj Close': 1.0339521559688147,
           'Open_Seasonal_Moving_Average_3': -1.5263975038020332, 'Open_Stationary_Seasonal_Moving_Average_3': 1.0601518479305023,
           'High_Seasonal_Moving_Average_3': -1.6181686793783123, 'High_Stationary_Seasonal_Moving_Average_3': 1.0617945030201263,
           'Low_Seasonal_Moving_Average_3': -1.4679973089920006, 'Low_Stationary_Seasonal_Moving_Average_3': 1.0600526174468958,
           'Close_Seasonal_Moving_Average_3': -1.6275930602347688, 'Close_Stationary_Seasonal_Moving_Average_3': 1.0518208509131413,
           'Adj Close_Seasonal_Moving_Average_3': -1.6275930602347688, 'Adj Close_Stationary_Seasonal_Moving_Average_3': 1.0518208509131413,
           'CHNCPIALLMINMEI_Seasonal_Moving_Average_3': -0.3196873271615117, 'CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3': 0.4773852839294433,
           'INTDSRCNM193N_Seasonal_Moving_Average_3': 48.66473521447997, 'INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3': -217.30550930174238,
           't-6': -8.498324117358287, 't-5': 0.1158936496368651, 't-4': 1.9443257455185823, 't-3': -4.773541547848075}

for column in columns:
    # Apply Yeo-Johnson transformation using the lambda parameter from the training data
    test_data_yeojohnson[column] = scipy.stats.yeojohnson(test_data_yeojohnson[column], lmbda=lambdas[column])

test_data_yeojohnson


Unnamed: 0,level_0,index,Open,High,Low,Close,Adj Close,Volume,CHNCPIALLMINMEI,INTDSRCNM193N,...,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Seasonal_Moving_Average_3,CHNCPIALLMINMEI_Stationary_Seasonal_Moving_Average_3,INTDSRCNM193N_Seasonal_Moving_Average_3,INTDSRCNM193N_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
45,1941,2017-11-01,38.033874,26.805623,54.803498,10.631344,10.631344,1.243808,0.492962,4.607753e+30,...,0.614403,51.189885,2.421391,0.250243,1.193406e+27,-0.0,0.11767,2.515667,37.032244,0.209482
46,1963,2017-12-01,37.729485,26.608867,54.245732,10.592371,10.592371,1.243791,0.492962,4.607753e+30,...,0.614403,-36.757422,2.422107,0.220578,1.193406e+27,-0.0,0.11767,2.514517,36.667804,0.209483
47,2007,2018-02-01,38.355456,26.996267,55.032804,10.656313,10.656313,1.243827,0.492964,4.607753e+30,...,0.614403,70.990405,2.423813,1.107821,1.193406e+27,-0.0,0.11767,2.508041,37.144639,0.209483
48,2022,2018-03-01,37.415538,26.505674,53.760376,10.57016,10.57016,1.2438,0.492963,4.607753e+30,...,0.614403,-60.876993,2.42459,-0.137885,1.193406e+27,-0.0,0.11767,2.516503,37.324842,0.209483
49,2086,2018-06-01,36.811013,26.080704,52.624576,10.465961,10.465961,1.243786,0.492963,4.607753e+30,...,0.614403,-152.502127,2.424871,-0.772788,1.193406e+27,-0.0,0.11767,2.519674,37.393551,0.209483
50,2129,2018-08-01,35.965059,25.569008,50.970412,10.325101,10.325101,1.243796,0.492963,4.607753e+30,...,0.614403,-185.168523,2.424449,0.474179,1.193406e+27,-0.0,0.11767,2.52088,37.205478,0.209483
51,2191,2018-11-01,34.790596,24.876823,49.346134,10.192514,10.192514,1.243814,0.492964,4.607753e+30,...,0.614403,-181.994662,2.425222,0.578498,1.193406e+27,-0.0,0.11767,2.517575,37.514134,0.209483
52,2257,2019-02-01,34.702296,24.826262,49.246916,10.200065,10.200065,1.243787,0.492965,4.607753e+30,...,0.614403,-54.782041,2.427453,1.022814,1.193406e+27,-0.0,0.11767,2.522991,37.098372,0.209482
53,2272,2019-03-01,36.270303,25.813768,51.730414,10.421573,10.421573,1.243838,0.492965,4.607753e+30,...,0.614403,322.811713,2.428695,0.279503,1.193406e+27,-0.0,0.11767,2.515686,36.596675,0.209482
54,2293,2019-04-01,36.920674,26.260756,52.977689,10.51668,10.51668,1.243847,0.492965,4.607753e+30,...,0.614403,307.25687,2.429655,-0.067816,1.193406e+27,-0.0,0.11767,2.50677,35.920521,0.209482


In [None]:
# Select the desired features
features = ['Open_Seasonal_Moving_Average_3', 'High_Seasonal_Moving_Average_3',
            'Low_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4', 't-3', 'Volume', 'High', 'Low',
            'Close_Seasonal_Moving_Average_3', 'CHNCPIALLMINMEI', 'INTDSRCNM193N']

# Create a new column 'Adj Close (t+1)' which shifts the 'Adj Close' value forward by one unit
train_data_yeojohnson['Adj Close (t+1)'] = train_data_yeojohnson['Adj Close'].shift(-1)
val_data_yeojohnson['Adj Close (t+1)'] = val_data_yeojohnson['Adj Close'].shift(-1)
test_data_yeojohnson['Adj Close (t+1)'] = test_data_yeojohnson['Adj Close'].shift(-1)

# Drop rows with null values
train_data_yeojohnson = train_data_yeojohnson.dropna()
val_data_yeojohnson = val_data_yeojohnson.dropna()
test_data_yeojohnson = test_data_yeojohnson.dropna()

# Set 'Adj Close (t+1)' as the target variable (y)
y_train = train_data_yeojohnson['Adj Close (t+1)']
y_val = val_data_yeojohnson['Adj Close (t+1)']
y_test = test_data_yeojohnson['Adj Close (t+1)']

# Drop 'Adj Close (t+1)' from the feature set
X_train = train_data_yeojohnson.drop(['Adj Close (t+1)'], axis=1)
X_val = val_data_yeojohnson.drop(['Adj Close (t+1)'], axis=1)
X_test = test_data_yeojohnson.drop(['Adj Close (t+1)'], axis=1)

# Select the desired features
X_train = X_train[features]
X_val = X_val[features]
X_test = X_test[features]

# Define the model as a pipeline
pipe = Pipeline([
    ('xgb', XGBRegressor(random_state=42))
])

# Define the hyperparameters to tune
param_space = {
    'xgb__n_estimators': (10, 800),
    'xgb__max_depth': (1, 200),
    'xgb__learning_rate': (0.01, 0.3)
}

# Initialize TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=10)

# Initialize BayesSearchCV
opt = BayesSearchCV(
    pipe,
    param_space,
    cv=tscv,
    n_iter=100,  # You can adjust this number
    random_state=0,
)

# Fit BayesSearchCV on the training data
opt.fit(X_train, y_train)

# Get the best model
best_model = opt.best_estimator_

# Predict one by one on validation data
predictions_val = []
for i in range(len(X_val)):
    current_pred = best_model.predict(X_val.iloc[i:i+1])
    predictions_val.append(current_pred[0])

# Combine training and validation data
train_data_yeojohnson = pd.concat([train_data_yeojohnson, val_data_yeojohnson]).reset_index(drop=True)
y_train = train_data_yeojohnson['Adj Close (t+1)']
X_train = train_data_yeojohnson[features]

# Retrain model on combined training and validation data
opt.fit(X_train, y_train)

# Get updated best model
best_model = opt.best_estimator_

# Predict one by one on test data
predictions_test = []
for i in range(len(X_test)):
    current_pred = best_model.predict(X_test.iloc[i:i+1])
    predictions_test.append(current_pred[0])


In [None]:
# Custom inverse Yeo-Johnson transformation
def inv_yeojohnson(y, lmbda):
    if lmbda == 0:
        return np.exp(y)
    elif lmbda < 0:
        return (1 - lmbda * y) ** (1 / lmbda)
    else:
        return (lmbda * y + 1) ** (1 / lmbda)

# Predict on training data
predictions_train = best_model.predict(X_train)

# Inverse transform actual and predicted values to original scale
y_train_inv = inv_yeojohnson(y_train, lambdas['Adj Close (t+1)'])
y_val_inv = inv_yeojohnson(y_val, lambdas['Adj Close (t+1)'])
y_test_inv = inv_yeojohnson(y_test, lambdas['Adj Close (t+1)'])

predictions_train_inv = inv_yeojohnson(predictions_train, lambdas['Adj Close (t+1)'])
predictions_val_inv = inv_yeojohnson(predictions_val, lambdas['Adj Close (t+1)'])
predictions_test_inv = inv_yeojohnson(predictions_test, lambdas['Adj Close (t+1)'])

# Define a function to calculate and print evaluation metrics
def calculate_metrics(y_true, y_pred, label="Set"):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{label} RMSE: {rmse}")
    print(f"{label} MAE: {mae}")
    print(f"{label} MAPE: {mape}")
    print(f"{label} R^2: {r2}\n")

# Calculate metrics for training, validation, and test sets
calculate_metrics(y_train_inv, predictions_train_inv, label="Training")
calculate_metrics(y_val_inv, predictions_val_inv, label="Validation")
calculate_metrics(y_test_inv, predictions_test_inv, label="Test")


Training RMSE: 0.0006334271449593486
Training MAE: 0.1193489942385699
Training MAPE: 0.004419465429856939
Training R^2: 0.42441916522985396

Validation RMSE: 0.21062055621322834
Validation MAE: 0.19876179193797494
Validation MAPE: 1.8142466824163435
Validation R^2: 0.32543402001964

Test RMSE: 0.1265133591270997
Test MAE: 0.09885935189654287
Test MAPE: 0.9065590967817378
Test R^2: 1671379942385699



In [1]:
print("Best parameters found for XGBoost:")
print(opt.best_params_)

Best parameters found for XGBoost:
OrderedDict([('xgb__learning_rate', 0.12548099768823331), ('xgb__max_depth', 1), ('xgb__n_estimators', 47)])


In [None]:
print("Best parameters found for XGBoost:")
print(opt.best_params_)