# 🌲 Random Forest with Bayesian Hyperparameter Optimization

This notebook implements a **Random Forest regression model** with **Bayesian Optimization (BayesSearchCV)** for predicting the next-day adjusted close price of the **Shanghai Stock Index**. The model is trained on features engineered from historical price, technical indicators, and macroeconomic variables.

---

## 🧱 Workflow Overview

### 1. 📥 Data Collection
- Shanghai Stock Index (`000001.SS`) from Yahoo Finance
- China's Consumer Price Index (CPI) and Interest Rate from FRED

### 2. 🧹 Preprocessing
- Handling missing values
- Detrending with 3-period **Seasonal Moving Average**
- First-order differencing (`diff Adj Close`)
- Lag features: `t-6`, `t-5`, `t-4`, `t-3`
- Outlier removal via Z-Score method

### 3. 📐 Feature Scaling
- Min-Max scaling applied **only on training data**
- Ensures **no data leakage** into validation or test sets

### 4. 🤖 Model Training
- Model: `RandomForestRegressor` wrapped in a pipeline
- Hyperparameter tuning via `BayesSearchCV` using `TimeSeriesSplit (n=10)`
- Features used:
  - Seasonal moving averages of OHLC
  - Lag features
  - Volume, CPI, Interest Rate

### 5. 📊 Evaluation
- Metrics: **RMSE**, **MAE**, **MAPE**, **R²**
- Performance reported on **Training**, **Validation**, and **Test** sets

---

## 🔬 Paper Context

This notebook corresponds to **Section 3.2** of the paper:  
*"The Application and Effectiveness of Machine Learning and Deep Learning Methods in Analyzing and Predicting the Shanghai Stock Index"*

It also supports the findings shown in **Table 1** and **Table 2**, where the Random Forest model's performance is compared with other statistical and deep learning approaches.

---

## ✅ Key Notes
- Validation data is only used for tuning; final model is retrained on `train + val`
- Predictions are inverse-transformed to the original price scale before evaluation


In [None]:
!pip install scikit-optimize

Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-24.7.0-py3-none-any.whl (24 kB)
Installing collected packages: pyaml, scikit-optimize
Successfully installed pyaml-24.7.0 scikit-optimize-0.10.2


In [None]:
!pip install --upgrade pandas


Collecting pandas
  Downloading pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.3
    Uninstalling pandas-2.0.3:
      Successfully uninstalled pandas-2.0.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pandas<2.2.2dev0,>=2.0, but you have pandas 2.2.2 which is incompatible.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.2 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.2.2


In [None]:
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter('ignore')
import statsmodels.api as sm
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.metrics import mean_squared_error
from math import sqrt
from scipy import stats
import yfinance as yf
import seaborn as sns
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import kpss
from sklearn.model_selection import ParameterGrid
from skopt import gp_minimize
from scipy.stats import boxcox
from datetime import datetime
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
import math
from sklearn.preprocessing import PowerTransformer
from scipy.stats import yeojohnson
from scipy.stats import zscore
from sklearn.ensemble import RandomForestRegressor
from skopt import BayesSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import TimeSeriesSplit
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import math
import pandas_datareader.data as web
from sklearn.metrics import mean_absolute_error, r2_score
import scipy.stats

In [None]:
# df1 = pd.DataFrame({'A': ['A0', 'A1'],
#                     'B': ['B0', 'B1']},
#                    index=[0, 1])

# df2 = pd.DataFrame({'A': ['A2', 'A3'],
#                     'B': ['B2', 'B3']},
#                    index=[2, 3])

# df1.append(df2)


In [None]:
# Define the time range for data retrieval
start = datetime(2010, 1, 4)
end = datetime(2020, 1, 23)

# Download Shanghai Stock Index data
shanghai_data = yf.download('000001.SS', start, end)

# Use pandas_datareader to download China's CPI data from FRED
cpi_data = web.DataReader('CHNCPIALLMINMEI', 'fred', start, end)

# Use pandas_datareader to download China's Interest Rate data from FRED
interest_rate_data = web.DataReader('INTDSRCNM193N', 'fred', start, end)

# Merge all dataframes into one without dropping NaN values
data = pd.concat([shanghai_data, cpi_data, interest_rate_data], axis=1)

# Rename columns for clarity
column_names = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
if 'CHNCPIALLMINMEI' in data.columns:
    column_names.append('China CPI')
if 'INTDSRCNM193N' in data.columns:
    column_names.append('China Interest Rate')

data.columns = column_names

# Reset index and drop NaN values
data = data.reset_index(drop=True).dropna()

data


[*********************100%%**********************]  1 of 1 completed


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,China CPI,China Interest Rate
20,2981.374023,2981.374023,2912.886963,2941.360107,2941.360107,88200.0,86.92914,2.79
35,3057.006104,3093.094971,3054.553955,3087.842041,3087.842041,111000.0,86.31617,2.79
58,3111.935059,3148.343994,3111.935059,3147.416016,3147.416016,126800.0,86.49231,2.79
100,2577.762939,2598.896973,2534.267090,2568.282959,2568.282959,74600.0,85.88638,2.79
119,2393.947998,2410.769043,2371.778076,2373.791992,2373.791992,50000.0,86.23162,2.79
...,...,...,...,...,...,...,...,...
2272,2954.402100,2994.004883,2930.834961,2994.004883,2994.004883,345800.0,107.70000,2.90
2293,3111.662109,3176.622070,3111.662109,3170.361084,3170.361084,466100.0,107.80000,2.90
2353,3024.624023,3045.366943,3014.687012,3044.903076,3044.903076,243300.0,108.10000,2.90
2376,2920.850098,2927.340088,2901.750000,2908.770020,2908.770020,138100.0,108.90000,2.90


In [None]:
def make_stationary_Seasonal_Moving_Average_3(data, column):
    # Calculate the seasonal moving average
    data[f'{column}_Seasonal_Moving_Average_3'] = data[column].rolling(window=3).mean()

    # Remove the seasonal trend from the data
    data[f'{column}_Stationary_Seasonal_Moving_Average_3'] = data[column] - data[f'{column}_Seasonal_Moving_Average_3']

    return data

# Create a new column named ''diff Adj Close'' that is the first difference of 'Adj Close'
data['diff Adj Close'] = data['Adj Close'].diff().dropna()



# Create a stationary series from each non-stationary series
non_stationary_columns = ['Open', 'High', 'Low', 'Close', 'Adj Close']
for column in non_stationary_columns:
    data = make_stationary_Seasonal_Moving_Average_3(data, column)


In [None]:
for i in range(6, 2, -1):
    data['t-'+str(i)] = np.log(data["Adj Close"].shift(i))


data = data.dropna()

data

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,China CPI,China Interest Rate,diff Adj Close,Open_Seasonal_Moving_Average_3,...,Low_Seasonal_Moving_Average_3,Low_Stationary_Seasonal_Moving_Average_3,Close_Seasonal_Moving_Average_3,Close_Stationary_Seasonal_Moving_Average_3,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
365,2767.833008,2778.667969,2752.966064,2759.362061,2759.362061,92000.0,92.01405,3.25,15.790039,2812.45695,...,2801.285075,-48.31901,2823.447998,-64.085938,2823.447998,-64.085938,8.024214,7.945714,7.937003,7.978969
386,2697.574951,2712.887939,2688.529053,2703.782959,2703.782959,59000.0,92.27545,3.25,-55.579102,2734.154948,...,2722.663411,-34.134359,2735.572347,-31.789388,2735.572347,-31.789388,7.945714,7.937003,7.978969,7.995445
409,2569.799072,2584.803955,2547.854004,2556.041992,2556.041992,57200.0,92.71113,3.25,-147.740967,2678.402344,...,2663.116374,-115.26237,2673.062337,-117.020345,2673.062337,-117.020345,7.937003,7.978969,7.995445,7.917016
447,2450.331055,2491.35498,2445.5271,2470.019043,2470.019043,88800.0,92.62399,3.25,-86.022949,2572.568359,...,2560.636719,-115.109619,2576.614665,-106.595622,2576.614665,-106.595622,7.978969,7.995445,7.917016,7.922755
469,2392.485107,2423.559082,2376.916016,2386.860107,2386.860107,98600.0,92.88539,3.25,-83.158936,2470.871745,...,2456.765706,-79.849691,2470.973714,-84.113607,2470.973714,-84.113607,7.995445,7.917016,7.922755,7.902407
507,2288.065918,2305.863037,2263.342041,2268.080078,2268.080078,53600.0,94.19241,3.25,-118.780029,2376.960693,...,2361.928385,-98.586344,2374.98641,-106.906331,2374.98641,-106.906331,7.917016,7.922755,7.902407,7.846215
528,2418.788086,2437.874023,2418.333984,2426.11499,2426.11499,75000.0,94.36668,3.25,158.034912,2366.44637,...,2352.864014,65.469971,2360.351725,65.763265,2360.351725,65.763265,7.922755,7.902407,7.846215,7.811981
591,2373.224121,2388.085938,2365.441895,2373.436035,2373.436035,77000.0,93.49534,3.25,-52.678955,2360.026042,...,2349.039307,16.402588,2355.877035,17.559001,2355.877035,17.559001,7.902407,7.846215,7.811981,7.777734
634,2101.7229,2130.711914,2101.708008,2123.360107,2123.360107,53800.0,94.193,3.25,-250.075928,2297.911702,...,2295.161296,-193.453288,2307.637044,-184.276937,2307.637044,-184.276937,7.846215,7.811981,7.777734,7.726689
697,2070.02002,2109.496094,2069.947998,2104.427002,2104.427002,82800.0,94.54095,3.25,-18.933105,2181.65568,...,2179.032633,-109.084635,2200.407715,-95.980713,2200.407715,-95.980713,7.811981,7.777734,7.726689,7.794046


In [None]:
# Determine the length of the training data (70%)
train_len = int(len(data["Adj Close"]) * 0.7)

# Determine the length of the validation data (10%)
val_len = int(len(data["Adj Close"]) * 0.1)

# Set the training, validation, and test data
train_data = data.iloc[:train_len]
val_data = data.iloc[train_len:train_len + val_len]
test_data = data.iloc[train_len + val_len:]


In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / (y_true + 1e-10))) * 100


In [None]:
data.columns

Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'China CPI',
       'China Interest Rate', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4',
       't-3'],
      dtype='object')

# ***1) Removing outlier data from the training dataframe***

In [None]:
# Selecting columns
columns = ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'China CPI',
       'China Interest Rate', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4',
       't-3']

# Calculating Z-Score for each column
z_scores = zscore(train_data[columns])

# Creating a training dataframe without outliers
train_data_without_outliers = train_data[(z_scores < 3).all(axis=1)]


# ***2) Scaling the training data with min-max scaler***

In [None]:
scaler = MinMaxScaler()

# Fit and transform the  columns
scaled_columns = scaler.fit_transform(train_data_without_outliers[['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'China CPI',
       'China Interest Rate', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4',
       't-3']])

# Create a new dataframe with scaled columns
train_data_scaled = pd.DataFrame(scaled_columns, columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'China CPI',
       'China Interest Rate', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4',
       't-3'])

train_data_scaled


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,China CPI,China Interest Rate,diff Adj Close,Open_Seasonal_Moving_Average_3,...,Low_Seasonal_Moving_Average_3,Low_Stationary_Seasonal_Moving_Average_3,Close_Seasonal_Moving_Average_3,Close_Stationary_Seasonal_Moving_Average_3,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
0,0.356667,0.337228,0.381339,0.37121,0.37121,0.070433,0.0,1.0,0.44771,0.35918,...,0.366925,0.467305,0.352281,0.448638,0.352281,0.448638,0.600529,0.63191,0.523205,0.430459
1,0.325416,0.308888,0.350446,0.34421,0.34421,0.009905,0.023369,1.0,0.41232,0.322979,...,0.329566,0.47557,0.311841,0.467081,0.311841,0.467081,0.489789,0.616055,0.588073,0.449101
2,0.26858,0.253707,0.283001,0.272437,0.272437,0.006603,0.062317,1.0,0.366619,0.297203,...,0.301271,0.428302,0.283073,0.41841,0.283073,0.41841,0.4775,0.692434,0.61354,0.360362
3,0.21544,0.213447,0.233942,0.230647,0.230647,0.064563,0.054527,1.0,0.397224,0.248273,...,0.252576,0.428391,0.238688,0.424363,0.238688,0.424363,0.536701,0.722421,0.49231,0.366856
4,0.18971,0.184239,0.201048,0.190248,0.190248,0.082539,0.077896,1.0,0.398644,0.201256,...,0.20322,0.448934,0.190072,0.437201,0.190072,0.437201,0.559943,0.579677,0.50118,0.343833
5,0.143263,0.133532,0.146597,0.132545,0.132545,0.0,0.194741,1.0,0.38098,0.157838,...,0.158156,0.438018,0.145898,0.424186,0.145898,0.424186,0.449304,0.590122,0.469728,0.280255
6,0.201409,0.190406,0.220905,0.209318,0.209318,0.039252,0.21032,1.0,0.518246,0.152977,...,0.153849,0.533603,0.139163,0.522788,0.139163,0.522788,0.457399,0.553088,0.382871,0.24152
7,0.181142,0.168956,0.195547,0.183727,0.183727,0.04292,0.132424,1.0,0.413758,0.150009,...,0.152032,0.505014,0.137104,0.495261,0.137104,0.495261,0.428695,0.450816,0.329954,0.202771
8,0.060376,0.058073,0.069104,0.06224,0.06224,0.000367,0.194793,1.0,0.315874,0.121292,...,0.126431,0.382745,0.114904,0.380004,0.114904,0.380004,0.349424,0.388509,0.277017,0.145016
9,0.046275,0.048933,0.053877,0.053042,0.053042,0.053558,0.225899,1.0,0.430492,0.067543,...,0.07125,0.431901,0.065556,0.430425,0.065556,0.430425,0.30113,0.326177,0.198114,0.221228


# ***3) Validation data scaling with min-max scaler***

In [None]:
val_data_scaled = val_data.copy()

scaled_columns_val = scaler.transform(val_data_scaled[columns])

val_data_scaled = pd.DataFrame(scaled_columns_val, columns=columns, index=val_data.index)

val_data_scaled


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,China CPI,China Interest Rate,diff Adj Close,Open_Seasonal_Moving_Average_3,...,Low_Seasonal_Moving_Average_3,Low_Stationary_Seasonal_Moving_Average_3,Close_Seasonal_Moving_Average_3,Close_Stationary_Seasonal_Moving_Average_3,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
1712,1.083228,1.083872,1.139726,1.091076,1.091076,0.468464,1.014755,0.23913,0.779063,0.565572,...,0.591799,0.818264,0.559161,0.794979,0.559161,0.794979,0.579827,0.689057,0.700868,0.606217
1772,1.069003,1.067246,1.123592,1.06856,1.06856,0.348678,1.059021,0.23913,0.646024,0.592039,...,0.622909,0.753408,0.590416,0.723199,0.590416,0.723199,0.534083,0.700868,0.713799,0.662702
1835,0.958542,0.943592,1.000559,0.945363,0.945363,0.27823,1.051644,0.23913,0.557504,0.593184,...,0.623022,0.634711,0.587044,0.610342,0.587044,0.610342,0.543237,0.713799,0.780308,0.692255
1879,1.097779,1.094821,1.154863,1.107577,1.107577,0.466938,1.081154,0.23913,0.808442,0.596125,...,0.626088,0.778492,0.590334,0.760788,0.590334,0.760788,0.55326,0.780308,0.815105,0.765195
1902,1.174653,1.170206,1.229431,1.171161,1.171161,0.582146,1.12542,0.23913,0.721724,0.617472,...,0.647528,0.816339,0.61079,0.788996,0.61079,0.788996,0.604811,0.815105,0.90099,0.75269


# ***4) Scaling test data with min-max scaler***

In [None]:
test_data_scaled = test_data.copy()


scaled_columns_test = scaler.transform(test_data_scaled[columns])


test_data_scaled = pd.DataFrame(scaled_columns_test, columns=columns, index=test_data.index)

test_data_scaled


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,China CPI,China Interest Rate,diff Adj Close,Open_Seasonal_Moving_Average_3,...,Low_Seasonal_Moving_Average_3,Low_Stationary_Seasonal_Moving_Average_3,Close_Seasonal_Moving_Average_3,Close_Stationary_Seasonal_Moving_Average_3,Adj Close_Seasonal_Moving_Average_3,Adj Close_Stationary_Seasonal_Moving_Average_3,t-6,t-5,t-4,t-3
1941,1.198126,1.194206,1.25587,1.195742,1.195742,0.322991,1.132798,0.23913,0.687433,0.665883,...,0.699246,0.759826,0.66071,0.732473,0.66071,0.732473,0.631782,0.90099,0.886265,0.682416
1963,1.131958,1.121735,1.180257,1.128902,1.128902,0.217701,1.154931,0.23913,0.607053,0.67279,...,0.704391,0.678835,0.664962,0.661482,0.664962,0.661482,0.698351,0.886265,0.80352,0.774296
2007,1.269194,1.265753,1.287306,1.239338,1.239338,0.526195,1.302484,0.23913,0.762918,0.691893,...,0.716114,0.763371,0.678555,0.745684,0.678555,0.745684,0.686938,0.80352,0.911705,0.808871
2022,1.064821,1.084276,1.115442,1.091457,1.091457,0.268311,1.213952,0.23913,0.535801,0.664957,...,0.687668,0.642905,0.657763,0.637089,0.657763,0.637089,0.622802,0.911705,0.952417,0.822034
2086,0.938685,0.933939,0.967304,0.921898,0.921898,0.194049,1.184441,0.23913,0.516742,0.625904,...,0.64453,0.568583,0.616491,0.540554,0.616491,0.540554,0.706656,0.952417,0.967915,0.785978
2129,0.768992,0.761122,0.760316,0.70796,0.70796,0.243896,1.258218,0.23913,0.477723,0.524832,...,0.537778,0.538408,0.510546,0.505214,0.510546,0.505214,0.738212,0.967915,0.925461,0.845105
2191,0.54625,0.541097,0.567,0.521602,0.521602,0.374619,1.295106,0.23913,0.501971,0.420048,...,0.42668,0.528291,0.39693,0.508664,0.39693,0.508664,0.750224,0.925461,0.995081,0.765406
2257,0.530095,0.525627,0.555506,0.531842,0.531842,0.19939,1.420526,0.23913,0.674824,0.337487,...,0.343262,0.649448,0.319161,0.643313,0.319161,0.643313,0.717318,0.995081,0.901237,0.668661
2272,0.829315,0.842685,0.854137,0.852636,0.852636,0.743133,1.391015,0.23913,0.947868,0.349676,...,0.362267,0.906995,0.348006,0.90498,0.348006,0.90498,0.77128,0.901237,0.787324,0.537264
2293,0.961262,0.996869,1.012831,1.00319,1.00319,1.049084,1.398393,0.23913,0.79819,0.433534,...,0.452579,0.916708,0.444024,0.895403,0.444024,0.895403,0.698543,0.787324,0.632609,0.412931


In [None]:
data.columns

Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'China CPI',
       'China Interest Rate', 'diff Adj Close',
       'Open_Seasonal_Moving_Average_3',
       'Open_Stationary_Seasonal_Moving_Average_3',
       'High_Seasonal_Moving_Average_3',
       'High_Stationary_Seasonal_Moving_Average_3',
       'Low_Seasonal_Moving_Average_3',
       'Low_Stationary_Seasonal_Moving_Average_3',
       'Close_Seasonal_Moving_Average_3',
       'Close_Stationary_Seasonal_Moving_Average_3',
       'Adj Close_Seasonal_Moving_Average_3',
       'Adj Close_Stationary_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4',
       't-3'],
      dtype='object')

# ***Random Forest with Min-Max Scaler and Some Features***

In [None]:
# Initialize the scaler
scaler_adj_close = MinMaxScaler()

# Fit and transform the 'Adj Close' column
train_data_scaled['Adj Close'] = scaler_adj_close.fit_transform(train_data_scaled[['Adj Close']])
val_data_scaled['Adj Close'] = scaler_adj_close.transform(val_data_scaled[['Adj Close']])
test_data_scaled['Adj Close'] = scaler_adj_close.transform(test_data_scaled[['Adj Close']])

# Select the desired features
features = ['Open_Seasonal_Moving_Average_3', 'High_Seasonal_Moving_Average_3',
            'Low_Seasonal_Moving_Average_3','t-6','t-5','t-4','t-3','Volume','China CPI','China Interest Rate']

# Create a new column 'Adj Close (t+1)' which shifts the 'Adj Close' value forward by one unit
train_data_scaled['Adj Close (t+1)'] = train_data_scaled['Adj Close'].shift(-1)
val_data_scaled['Adj Close (t+1)'] = val_data_scaled['Adj Close'].shift(-1)
test_data_scaled['Adj Close (t+1)'] = test_data_scaled['Adj Close'].shift(-1)

# Drop rows with null values# Initialize and fit the scaler only on training data
scaler_adj_close = MinMaxScaler()
scaler_adj_close.fit(train_data_scaled[['Adj Close']])

# Transform all datasets using the fitted scaler
train_data_scaled['Adj Close'] = scaler_adj_close.transform(train_data_scaled[['Adj Close']])
val_data_scaled['Adj Close'] = scaler_adj_close.transform(val_data_scaled[['Adj Close']])
test_data_scaled['Adj Close'] = scaler_adj_close.transform(test_data_scaled[['Adj Close']])

# Select the desired features
features = [
    'Open_Seasonal_Moving_Average_3', 'High_Seasonal_Moving_Average_3',
    'Low_Seasonal_Moving_Average_3', 't-6', 't-5', 't-4', 't-3',
    'Volume', 'China CPI', 'China Interest Rate'
]

# Create target variable (next-step prediction)
train_data_scaled['Adj Close (t+1)'] = train_data_scaled['Adj Close'].shift(-1)
val_data_scaled['Adj Close (t+1)'] = val_data_scaled['Adj Close'].shift(-1)
test_data_scaled['Adj Close (t+1)'] = test_data_scaled['Adj Close'].shift(-1)

# Drop rows with NaN values caused by shifting
train_data_scaled = train_data_scaled.dropna()
val_data_scaled = val_data_scaled.dropna()
test_data_scaled = test_data_scaled.dropna()

# Define features and target variables
X_train = train_data_scaled[features]
y_train = train_data_scaled['Adj Close (t+1)']

X_val = val_data_scaled[features]
y_val = val_data_scaled['Adj Close (t+1)']

X_test = test_data_scaled[features]
y_test = test_data_scaled['Adj Close (t+1)']

# Define the model pipeline
pipe = Pipeline([
    ('rf', RandomForestRegressor(warm_start=True, random_state=42))
])

# Hyperparameter search space
param_space = {
    'rf__n_estimators': (10, 800),
    'rf__max_depth': (1, 200),
    'rf__max_features': ['auto', 'sqrt', 'log2']
}

# Time series cross-validation
tscv = TimeSeriesSplit(n_splits=10)

# Bayesian search for hyperparameter tuning
opt = BayesSearchCV(
    pipe,
    param_space,
    cv=tscv,
    n_iter=100,
    random_state=0
)

# Fit on training data
opt.fit(X_train, y_train)
best_model = opt.best_estimator_

# Predict on validation set iteratively
predictions_val = [
    best_model.predict(X_val.iloc[i:i+1])[0]
    for i in range(len(X_val))
]

# Inverse transform predictions and targets
predictions_val_rescaled = scaler_adj_close.inverse_transform(np.array(predictions_val).reshape(-1, 1))
y_val_rescaled = scaler_adj_close.inverse_transform(y_val.values.reshape(-1, 1))

# Compute validation RMSE
rmse_val = math.sqrt(mean_squared_error(y_val_rescaled, predictions_val_rescaled))
print(f'Validation RMSE: {rmse_val:.4f}')

# Combine train and validation for retraining
train_data_combined = pd.concat([train_data_scaled, val_data_scaled]).reset_index(drop=True)
X_train_full = train_data_combined[features]
y_train_full = train_data_combined['Adj Close (t+1)']

# Refit model
opt.fit(X_train_full, y_train_full)
best_model = opt.best_estimator_

# Predict on test set iteratively
predictions_test = [
    best_model.predict(X_test.iloc[i:i+1])[0]
    for i in range(len(X_test))
]

# Inverse transform predictions and targets
predictions_test_rescaled = scaler_adj_close.inverse_transform(np.array(predictions_test).reshape(-1, 1))
y_test_rescaled = scaler_adj_close.inverse_transform(y_test.values.reshape(-1, 1))

# Compute test RMSE
rmse_test = math.sqrt(mean_squared_error(y_test_rescaled, predictions_test_rescaled))
print(f'Test RMSE: {rmse_test:.4f}')

train_data_scaled = train_data_scaled.dropna()
val_data_scaled = val_data_scaled.dropna()
test_data_scaled = test_data_scaled.dropna()

# Set 'Adj Close (t+1)' as the target variable (y)
y_train = train_data_scaled['Adj Close (t+1)']
y_val = val_data_scaled['Adj Close (t+1)']
y_test = test_data_scaled['Adj Close (t+1)']

# Drop 'Adj Close (t+1)' from the feature set
X_train = train_data_scaled.drop(['Adj Close (t+1)'], axis=1)
X_val = val_data_scaled.drop(['Adj Close (t+1)'], axis=1)
X_test = test_data_scaled.drop(['Adj Close (t+1)'], axis=1)

# Select the desired features
X_train = X_train[features]
X_val = X_val[features]
X_test = X_test[features]

# Define the model as a pipeline
pipe = Pipeline([
    ('rf', RandomForestRegressor(warm_start=True, random_state=42))
])

# Define the hyperparameters to tune
param_space = {'rf__n_estimators': (10, 800), 'rf__max_depth': (1, 200), 'rf__max_features': ['sqrt', 'log2', None]}

# Initialize TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=10)

# Initialize BayesSearchCV
opt = BayesSearchCV(
    pipe,
    param_space,
    cv=tscv,
    n_iter=100,  # Change this to the desired number of iterations
    random_state=0,
)

# Fit BayesSearchCV on the training data
opt.fit(X_train, y_train)

# Get the best model
best_model = opt.best_estimator_

# Create a list to store the predictions
predictions_val = []

# Loop over the validation data
for i in range(len(X_val)):
    # Make a prediction for the current day
    current_pred = best_model.predict(X_val.iloc[i:i+1])

    # Add the prediction to the list of predictions
    predictions_val.append(current_pred[0])

# Convert the validation predictions to the original scale
predictions_val_rescaled = scaler_adj_close.inverse_transform(np.array(predictions_val).reshape(-1, 1))

# Convert y_val to the original scale
y_val_rescaled = scaler_adj_close.inverse_transform(y_val.values.reshape(-1, 1))

# Calculate RMSE for the validation data
rmse_val = math.sqrt(mean_squared_error(y_val_rescaled, predictions_val_rescaled))
print(f'Validation RMSE: {rmse_val}')

# Add the validation data to the training data
train_data_scaled = pd.concat([train_data_scaled, val_data_scaled]).reset_index(drop=True)
y_train = train_data_scaled['Adj Close (t+1)']
X_train = train_data_scaled[features]

# Retrain the model on the updated training data
opt.fit(X_train, y_train)

# Get the updated model
best_model = opt.best_estimator_

# Create a list to store the predictions
predictions_test = []

# Loop over the test data
for i in range(len(X_test)):
    # Make a prediction for the current day
    current_pred = best_model.predict(X_test.iloc[i:i+1])

    # Add the prediction to the list of predictions
    predictions_test.append(current_pred[0])

# Convert the test predictions to the original scale
predictions_test_rescaled = scaler_adj_close.inverse_transform(np.array(predictions_test).reshape(-1, 1))

# Convert y_test to the original scale
y_test_rescaled = scaler_adj_close.inverse_transform(y_test.values.reshape(-1, 1))

# Calculate RMSE for the test data
rmse_test = math.sqrt(mean_squared_error(y_test_rescaled, predictions_test_rescaled))

print(f'Test RMSE: {rmse_test}')


Validation RMSE: 0.22432668565231925
Test RMSE: 0.24625465323564838


In [2]:
# Calculate RMSE, MAE, MAPE and R^2 for the training data
predictions_train = best_model.predict(X_train)
predictions_train_rescaled = scaler_adj_close.inverse_transform(np.array(predictions_train).reshape(-1, 1))
y_train_rescaled = scaler_adj_close.inverse_transform(y_train.values.reshape(-1, 1))
rmse_train = math.sqrt(mean_squared_error(y_train_rescaled, predictions_train_rescaled))
mae_train = mean_absolute_error(y_train_rescaled, predictions_train_rescaled)
mape_train = mean_absolute_percentage_error(y_train_rescaled, predictions_train_rescaled)
r2_train = r2_score(y_train_rescaled, predictions_train_rescaled)
print(f'Training RMSE: {rmse_train}')
print(f'Training MAE: {mae_train}')
print(f'Training MAPE: {mape_train}%')
print(f'Training R^2: {r2_train}')

# Calculate RMSE, MAE, MAPE and R^2 for the validation data
rmse_val = math.sqrt(mean_squared_error(y_val_rescaled, predictions_val_rescaled))
mae_val = mean_absolute_error(y_val_rescaled, predictions_val_rescaled)
mape_val = mean_absolute_percentage_error(y_val_rescaled, predictions_val_rescaled)
r2_val = r2_score(y_val_rescaled, predictions_val_rescaled)
print(f'Validation RMSE: {rmse_val}')
print(f'Validation MAE: {mae_val}')
print(f'Validation MAPE: {mape_val}%')
print(f'Validation R^2: {r2_val}')

# Calculate RMSE, MAE, MAPE and R^2 for the test data
rmse_test = math.sqrt(mean_squared_error(y_test_rescaled, predictions_test_rescaled))
mae_test = mean_absolute_error(y_test_rescaled, predictions_test_rescaled)
mape_test = mean_absolute_percentage_error(y_test_rescaled, predictions_test_rescaled)
r2_test = r2_score(y_test_rescaled, predictions_test_rescaled)
print(f'Test RMSE: {rmse_test}')
print(f'Test MAE: {mae_test}')
print(f'Test MAPE: {mape_test}%')
print(f'Test R^2: {r2_test}')


Training RMSE: 0.056779405722064996
Training MAE: 0.045432835302089664
Training MAPE: 339129655.50817347%
Training R^2: 0.9735782288990061
Validation RMSE: 0.22432668565231925
Validation MAE: 0.20683675997142228
Validation MAPE: 18.74237605131762%
Validation R^2: 0.416177188167471
Test RMSE: 0.24625465323564838
Test MAE: 0.18797928580097897
Test MAPE: 27.787845444697627%
Test R^2: 0.3326826540846419


In [3]:
Best parameters found: OrderedDict([('rf__max_depth', 11), ('rf__max_features', None), ('rf__n_estimators', 10)])


Best parameters found: OrderedDict([('rf__max_depth', 11), ('rf__max_features', None), ('rf__n_estimators', 10)])
