## Introduction

This notebook briefly analyzes the effect of different datasets on the performance of LSTM models. The model was trained in two configurations: to predict the next day's closing price and to predict the next day's return. The second configuration enables much better investment strategies than the first one. The datasets range from having only close values to having OHLC values with multiple lags as features.

## Data Gathering

In [1]:
# Import used libraries
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import pandas_ta as ta
import warnings
warnings.filterwarnings('ignore')

sns.set()

In [2]:
# Import Petrobras data from Yahoo Finance trough pandas_ta
df = pd.DataFrame()
df = df.ta.ticker("petr4.sa", start='1998-01-01', end='2022-12-29')
df.index = df.index.date
df.index.name = 'Date'
df = df[:-1]
df = df.iloc[:, :5]
# make index datetime
df.index = pd.to_datetime(df.index)
# sort by index
df = df.sort_index()

# get days in which stocks were traded (aka: the index the df)
days_total = (pd.DataFrame(df.index)
                .reset_index(drop=False)
                .set_index('Date')
                .drop(columns=['index']))

In [11]:
df['target_Close'] = df['Close']
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5776 entries, 2000-01-03 to 2022-12-27
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          5776 non-null   float64
 1   High          5776 non-null   float64
 2   Low           5776 non-null   float64
 3   Close         5776 non-null   float64
 4   Volume        5776 non-null   int64  
 5   target_Close  5776 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 315.9 KB


## Predicting Next-Day Close Value

Here, four identical LSTM models were trained using different data to study the effect it has on the model's performance.

- The model with only close values uses only the previous day's closing price.
- The model with close and lags uses many previous closing prices.
- The model with open, high, low, and close (OHLC) data from the previous day.
- The model with OHLC data from many previous days.

Note: the quality of the predictions didn't change much from dataset to dataset. Also, predicting the close price is not really interesting because what is more important is to predict the return of the next day.

### Model with Close

In [12]:
# define macros for the columns
MOVE_FORWARD = [str(col) for col in df.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df.columns if col not in ['target_Close', 'Open', 'High', 'Low', 'Volume']]
TARGET = ['target_Close']
delay = 1

df_shift = df.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_shift.loc[:, MOVE_FORWARD] = df_shift.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_shift.columns]))
df_shift = df_shift.dropna()#.reset_index()

display(df_shift.head(1))

COL_TO_DROP = ['Open', 'High', 'Low', 'Volume']
df_feature_w_target = df_shift.drop(COL_TO_DROP, axis=1).copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-04,1.872028,1.872028,1.872028,1.872028,35389440000.0,1.768468


Unnamed: 0_level_0,Close,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-04,1.872028,1.768468


In [13]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
valid_data = scaler.transform(valid_data)
test_data = scaler.transform(test_data)

# Split the data into features and targets
X_train1 = train_data[:, :-1]
y_train1 = train_data[:, -1]
X_valid1 = valid_data[:, :-1]
y_valid1 = valid_data[:, -1]
X_test1 = test_data[:, :-1]
y_test1 = test_data[:, -1]

# Reshape the data for LSTM
X_train1 = np.reshape(X_train1, (*(X_train1.shape), 1))
X_valid1 = np.reshape(X_valid1, (*(X_valid1.shape), 1))
X_test1 = np.reshape(X_test1, (*(X_test1.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train1.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp1 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train1, y_train1, epochs=5, batch_size=32, validation_data=(X_valid1, y_valid1))#, callbacks=[cp1])

# from tensorflow.keras.models import load_model
# model1 = load_model('model1/')
model1 = model

# Evaluate the model on the test set
_, mse = model1.evaluate(X_test1, y_test1)
print(f'Test MSE: {mse:.4f}\n') # Test MSE: 0.0897, 0.1060

# make predictions
y_pred1 = model1.predict(X_test1).flatten()


2023-03-09 17:31:24.190206: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-09 17:31:26.329323: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-09 17:31:26.329490: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-09 17:31:34.782720: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 1, 50)             10400     
                                                                 
 dropout (Dropout)           (None, 1, 50)             0         
                                                                 
 lstm_1 (LSTM)               (None, 1, 50)             20200     
                                                                 
 dropout_1 (Dropout)         (None, 1, 50)             0         
                                                                 
 lstm_2 (LSTM)               (None, 50)                20200     
                                                                 
 dense (Dense)               (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-traina

### Model with Close and Lags

In [14]:
# adder of lags to the dataframe
def get_data_w_lags(df):
    df = df.copy()

    # forwarddelay = 1, 2, 3, 5, 10, 15, 20 of OHLC columns
    for lag in [1, 2, 3, 5, 10, 15, 20]:
        col = 'Close'
        df[col + '_lag_f' + str(lag)] = df[col].shift(lag)

    return df

df_w_lags = get_data_w_lags(df)

# define macros for the columns
MOVE_FORWARD = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
TARGET = ['target_Close']
delay = 1

df_w_lags_shift = df_w_lags.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_w_lags_shift.loc[:, MOVE_FORWARD] = df_w_lags.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_w_lags_shift.columns]))
df_w_lags_shift = df_w_lags_shift.dropna()#.reset_index()

display(df_w_lags_shift.head(1))

# move columns TARGET to the end
df_w_lags_shift = df_w_lags_shift[[col for col in df_w_lags_shift.columns if col not in TARGET] + TARGET]

df_feature_w_target = df_w_lags_shift.copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close,Close_lag_f1,Close_lag_f2,Close_lag_f3,Close_lag_f5,Close_lag_f10,Close_lag_f15,Close_lag_f20
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2000-02-01,1.625398,1.625398,1.625398,1.625398,32266240000.0,1.656944,1.65312,1.65312,1.664909,1.645153,1.700916,1.788542,1.872028


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Close_lag_f1,Close_lag_f2,Close_lag_f3,Close_lag_f5,Close_lag_f10,Close_lag_f15,Close_lag_f20,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2000-02-01,1.625398,1.625398,1.625398,1.625398,32266240000.0,1.65312,1.65312,1.664909,1.645153,1.700916,1.788542,1.872028,1.656944


In [15]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
valid_data = scaler.transform(valid_data)
test_data = scaler.transform(test_data)

# Split the data into features and targets
X_train0 = train_data[:, :-1]
y_train0 = train_data[:, -1]
X_valid0 = valid_data[:, :-1]
y_valid0 = valid_data[:, -1]
X_test0 = test_data[:, :-1]
y_test0 = test_data[:, -1]

# Reshape the data for LSTM
X_train0 = np.reshape(X_train0, (*(X_train0.shape), 1))
X_valid0 = np.reshape(X_valid0, (*(X_valid0.shape), 1))
X_test0 = np.reshape(X_test0, (*(X_test0.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train0.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp0 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train0, y_train0, epochs=5, batch_size=32, validation_data=(X_valid0, y_valid0))#, callbacks=[cp0])

# from tensorflow.keras.models import load_model
# model0 = load_model('model0/')
model0 = model

# Evaluate the model on the test set
_, mse = model0.evaluate(X_test0, y_test0)
print(f'Test MSE: {mse:.4f}\n') # Test MSE: 0.1109, 0.1099

# make predictions
y_pred0 = model0.predict(X_test0).flatten()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_3 (LSTM)               (None, 12, 50)            10400     
                                                                 
 dropout_2 (Dropout)         (None, 12, 50)            0         
                                                                 
 lstm_4 (LSTM)               (None, 12, 50)            20200     
                                                                 
 dropout_3 (Dropout)         (None, 12, 50)            0         
                                                                 
 lstm_5 (LSTM)               (None, 50)                20200     
                                                                 
 dense_1 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai

### Model with OHLC

In [16]:
# define macros for the columns
MOVE_FORWARD = [str(col) for col in df.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df.columns if col not in ['target_Close']]
TARGET = ['target_Close']
delay = 1

df_shift = df.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_shift.loc[:, MOVE_FORWARD] = df_shift.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_shift.columns]))
df_shift = df_shift.dropna()#.reset_index()

display(df_shift.head(1))

df_feature_w_target = df_shift.copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-04,1.872028,1.872028,1.872028,1.872028,35389440000.0,1.768468


Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-04,1.872028,1.872028,1.872028,1.872028,35389440000.0,1.768468


In [17]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
valid_data = scaler.transform(valid_data)
test_data = scaler.transform(test_data)

# Split the data into features and targets
X_train2 = train_data[:, :-1]
y_train2 = train_data[:, -1]
X_valid2 = valid_data[:, :-1]
y_valid2 = valid_data[:, -1]
X_test2 = test_data[:, :-1]
y_test2 = test_data[:, -1]

# Reshape the data for LSTM
X_train2 = np.reshape(X_train2, (*(X_train2.shape), 1))
X_valid2 = np.reshape(X_valid2, (*(X_valid2.shape), 1))
X_test2 = np.reshape(X_test2, (*(X_test2.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train2.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp2 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train2, y_train2, epochs=5, batch_size=32, validation_data=(X_valid2, y_valid2))#, callbacks=[cp2])

# from tensorflow.keras.models import load_model
# model2 = load_model('model2/')
model2 = model

# Evaluate the model on the test set
_, mse = model2.evaluate(X_test2, y_test2)
print(f'Test MSE: {mse:.4f}\n') #Test MSE: 0.0923, 0.0638

# make predictions
y_pred2 = model2.predict(X_test2).flatten()


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_6 (LSTM)               (None, 5, 50)             10400     
                                                                 
 dropout_4 (Dropout)         (None, 5, 50)             0         
                                                                 
 lstm_7 (LSTM)               (None, 5, 50)             20200     
                                                                 
 dropout_5 (Dropout)         (None, 5, 50)             0         
                                                                 
 lstm_8 (LSTM)               (None, 50)                20200     
                                                                 
 dense_2 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai

### Model OHLC with Lags

In [18]:
# adder of lags to the dataframe
def get_data_w_lags(df):
    df = df.copy()

    # forwarddelay = 1, 2, 3, 5, 10, 15, 20 of OHLC columns
    for lag in [1, 2, 3, 5, 10, 15, 20]:
        for col in ['Open', 'High', 'Low', 'Close']:
            df[col + '_stock_price_lag_f' + str(lag)] = df[col].shift(lag)

    return df

df_w_lags = get_data_w_lags(df)

# define macros for the columns
MOVE_FORWARD = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
TARGET = ['target_Close']
delay = 1

df_w_lags_shift = df_w_lags.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_w_lags_shift.loc[:, MOVE_FORWARD] = df_w_lags.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_w_lags_shift.columns]))
df_w_lags_shift = df_w_lags_shift.dropna()#.reset_index()

display(df_w_lags_shift.head(1))

# move the TARGET columns to the end of the dataframe
df_w_lags_shift = df_w_lags_shift[[col for col in df_w_lags_shift.columns if col not in TARGET] + TARGET]

df_feature_w_target = df_w_lags_shift.copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,...,Low_stock_price_lag_f10,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-02-01,1.625398,1.625398,1.625398,1.625398,32266240000.0,1.656944,1.65312,1.65312,1.65312,1.65312,...,1.700916,1.700916,1.788542,1.788542,1.788542,1.788542,1.872028,1.872028,1.872028,1.872028


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,Open_stock_price_lag_f2,...,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-02-01,1.625398,1.625398,1.625398,1.625398,32266240000.0,1.65312,1.65312,1.65312,1.65312,1.65312,...,1.700916,1.788542,1.788542,1.788542,1.788542,1.872028,1.872028,1.872028,1.872028,1.656944


In [19]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler = MinMaxScaler()
train_data = scaler.fit_transform(train_data)
valid_data = scaler.transform(valid_data)
test_data = scaler.transform(test_data)

# Split the data into features and targets
X_train3 = train_data[:, :-1]
y_train3 = train_data[:, -1]
X_valid3 = valid_data[:, :-1]
y_valid3 = valid_data[:, -1]
X_test3 = test_data[:, :-1]
y_test3 = test_data[:, -1]

# Reshape the data for LSTM
X_train3 = np.reshape(X_train3, (*(X_train3.shape), 1))
X_valid3 = np.reshape(X_valid3, (*(X_valid3.shape), 1))
X_test3 = np.reshape(X_test3, (*(X_test3.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train3.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp3 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train3, y_train3, epochs=5, batch_size=32, validation_data=(X_valid3, y_valid3))#, callbacks=[cp3])

# from tensorflow.keras.models import load_model
# model3 = load_model('model3/')
model3 = model

# Evaluate the model on the test set
_, mse = model3.evaluate(X_test3, y_test3)
print(f'Test MSE: {mse:.4f}\n') #Test MSE: 0.1727, 0.1096

# make predictions
y_pred3 = model3.predict(X_test3).flatten()


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_9 (LSTM)               (None, 33, 50)            10400     
                                                                 
 dropout_6 (Dropout)         (None, 33, 50)            0         
                                                                 
 lstm_10 (LSTM)              (None, 33, 50)            20200     
                                                                 
 dropout_7 (Dropout)         (None, 33, 50)            0         
                                                                 
 lstm_11 (LSTM)              (None, 50)                20200     
                                                                 
 dense_3 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai

## Predicting Next-Day Return

Here, the same LSTM model and dataset were used.

Note: the quality of the predictions didn't change much from dataset to dataset.

In [20]:
df_pct_change = pd.DataFrame()
for col in df.columns:
    df_pct_change[col] = df[col].pct_change()
df_pct_change

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-03,,,,,,
2000-01-04,-0.055319,-0.055319,-0.055319,-0.055319,-0.184462,-0.055319
2000-01-05,-0.010090,-0.010090,-0.010090,-0.010090,0.491041,-0.010090
2000-01-06,-0.003458,-0.003458,-0.003458,-0.003458,-0.208626,-0.003458
2000-01-07,0.004566,0.004566,0.004566,0.004566,-0.385928,0.004566
...,...,...,...,...,...,...
2022-12-21,0.049576,0.020654,0.028816,0.021673,0.033075,0.021673
2022-12-22,0.017021,0.035413,0.037199,0.017819,0.102919,0.017819
2022-12-23,0.009623,0.026873,0.018143,0.047103,-0.164930,0.047103
2022-12-26,0.041028,0.003172,0.028595,-0.007166,-0.579382,-0.007166


### Close Return

In [21]:
# define macros for the columns
MOVE_FORWARD = [str(col) for col in df_pct_change.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df_pct_change.columns if col not in ['target_Close', 'Open', 'High', 'Low', 'Volume']]
TARGET = ['target_Close']
delay = 1

df_shift = df_pct_change.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_shift.loc[:, MOVE_FORWARD] = df_shift.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_shift.columns]))
df_shift = df_shift.dropna()#.reset_index()

display(df_shift.head(1))

COL_TO_DROP = ['Open', 'High', 'Low', 'Volume']
df_feature_w_target = df_shift.drop(COL_TO_DROP, axis=1).copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-05,-0.055319,-0.055319,-0.055319,-0.055319,-0.184462,-0.01009


Unnamed: 0_level_0,Close,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2000-01-05,-0.055319,-0.01009


In [22]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler1 = MinMaxScaler()
train_data = scaler1.fit_transform(train_data)
valid_data = scaler1.transform(valid_data)
test_data = scaler1.transform(test_data)

# Split the data into features and targets
X_train1 = train_data[:, :-1]
y_train1 = train_data[:, -1]
X_valid1 = valid_data[:, :-1]
y_valid1 = valid_data[:, -1]
X_test1 = test_data[:, :-1]
y_test1 = test_data[:, -1]

# Reshape the data for LSTM
X_train1 = np.reshape(X_train1, (*(X_train1.shape), 1))
X_valid1 = np.reshape(X_valid1, (*(X_valid1.shape), 1))
X_test1 = np.reshape(X_test1, (*(X_test1.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train1.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp1 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train1, y_train1, epochs=5, batch_size=32, validation_data=(X_valid1, y_valid1))#, callbacks=[cp1])

# from tensorflow.keras.models import load_model
# model1 = load_model('model1/')
model1 = model

# Evaluate the model on the test set
_, mse = model1.evaluate(X_test1, y_test1)
print(f'Test MSE: {mse:.4f}\n') # Test MSE: 0.1043, 0.1032, 0.1027, 0.1028, 0.1029

# make predictions
y_pred1 = model1.predict(X_test1).flatten()


Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_12 (LSTM)              (None, 1, 50)             10400     
                                                                 
 dropout_8 (Dropout)         (None, 1, 50)             0         
                                                                 
 lstm_13 (LSTM)              (None, 1, 50)             20200     
                                                                 
 dropout_9 (Dropout)         (None, 1, 50)             0         
                                                                 
 lstm_14 (LSTM)              (None, 50)                20200     
                                                                 
 dense_4 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai

### Close Return with Lags

In [23]:
# adder of lags to the dataframe
def get_data_w_lags(df):
    df = df.copy()

    # forwarddelay = 1, 2, 3, 5, 10, 15, 20 of OHLC columns
    for lag in [1, 2, 3, 5, 10, 15, 20]:
        col = 'Close'
        df[col + '_lag_f' + str(lag)] = df[col].shift(lag)

    return df

df_w_lags = get_data_w_lags(df_pct_change)

# define macros for the columns
MOVE_FORWARD = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
TARGET = ['target_Close']
delay = 1

df_w_lags_shift = df_w_lags.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_w_lags_shift.loc[:, MOVE_FORWARD] = df_w_lags.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_w_lags_shift.columns]))
df_w_lags_shift = df_w_lags_shift.dropna()#.reset_index()

display(df_w_lags_shift.head(1))

# move columns TARGET to the end
df_w_lags_shift = df_w_lags_shift[[col for col in df_w_lags_shift.columns if col not in TARGET] + TARGET]

COL_TO_DROP = ['Open', 'High', 'Low', 'Volume']
df_w_lags_shift = df_w_lags_shift.drop(COL_TO_DROP, axis=1).copy()
df_feature_w_target = df_w_lags_shift.copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close,Close_lag_f1,Close_lag_f2,Close_lag_f3,Close_lag_f5,Close_lag_f10,Close_lag_f15,Close_lag_f20
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2000-02-02,0.019408,0.019408,0.019408,0.019408,-0.266344,0.019231,-0.01677,0.0,-0.007081,0.0,0.004684,-0.024586,-0.055319


Unnamed: 0_level_0,Close,Close_lag_f1,Close_lag_f2,Close_lag_f3,Close_lag_f5,Close_lag_f10,Close_lag_f15,Close_lag_f20,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2000-02-02,0.019408,-0.01677,0.0,-0.007081,0.0,0.004684,-0.024586,-0.055319,0.019231


In [24]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler0 = MinMaxScaler()
train_data = scaler0.fit_transform(train_data)
valid_data = scaler0.transform(valid_data)
test_data = scaler0.transform(test_data)

# Split the data into features and targets
X_train0 = train_data[:, :-1]
y_train0 = train_data[:, -1]
X_valid0 = valid_data[:, :-1]
y_valid0 = valid_data[:, -1]
X_test0 = test_data[:, :-1]
y_test0 = test_data[:, -1]

# Reshape the data for LSTM
X_train0 = np.reshape(X_train0, (*(X_train0.shape), 1))
X_valid0 = np.reshape(X_valid0, (*(X_valid0.shape), 1))
X_test0 = np.reshape(X_test0, (*(X_test0.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train0.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp0 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train0, y_train0, epochs=5, batch_size=32, validation_data=(X_valid0, y_valid0))#, callbacks=[cp0])

# from tensorflow.keras.models import load_model
# model0 = load_model('model0/')
model0 = model

# Evaluate the model on the test set
_, mse = model0.evaluate(X_test0, y_test0)
print(f'Test MSE: {mse:.4f}\n') # Test MSE: 0.1023, 0.1024, 0.1041, 0.1027

# make predictions
y_pred0 = model0.predict(X_test0).flatten()


Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_15 (LSTM)              (None, 8, 50)             10400     
                                                                 
 dropout_10 (Dropout)        (None, 8, 50)             0         
                                                                 
 lstm_16 (LSTM)              (None, 8, 50)             20200     
                                                                 
 dropout_11 (Dropout)        (None, 8, 50)             0         
                                                                 
 lstm_17 (LSTM)              (None, 50)                20200     
                                                                 
 dense_5 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai

### Close Return with OHLC

In [25]:
# define macros for the columns
MOVE_FORWARD = [str(col) for col in df_pct_change.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df_pct_change.columns if col not in ['target_Close']]
TARGET = ['target_Close']
delay = 1

df_shift = df_pct_change.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_shift.loc[:, MOVE_FORWARD] = df_shift.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_shift.columns]))
df_shift = df_shift.dropna()#.reset_index()

display(df_shift.head(1))

drop_cols = ['Volume']
df_shift = df_shift.drop(drop_cols, axis=1)

df_feature_w_target = df_shift.copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-05,-0.055319,-0.055319,-0.055319,-0.055319,-0.184462,-0.01009


Unnamed: 0_level_0,Open,High,Low,Close,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-05,-0.055319,-0.055319,-0.055319,-0.055319,-0.01009


In [26]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler2 = MinMaxScaler()
train_data = scaler2.fit_transform(train_data)
valid_data = scaler2.transform(valid_data)
test_data = scaler2.transform(test_data)

# Split the data into features and targets
X_train2 = train_data[:, :-1]
y_train2 = train_data[:, -1]
X_valid2 = valid_data[:, :-1]
y_valid2 = valid_data[:, -1]
X_test2 = test_data[:, :-1]
y_test2 = test_data[:, -1]

# Reshape the data for LSTM
X_train2 = np.reshape(X_train2, (*(X_train2.shape), 1))
X_valid2 = np.reshape(X_valid2, (*(X_valid2.shape), 1))
X_test2 = np.reshape(X_test2, (*(X_test2.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train2.shape[1], 1)))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp2 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train2, y_train2, epochs=5, batch_size=22, validation_data=(X_valid2, y_valid2))#, callbacks=[cp2])

# from tensorflow.keras.models import load_model
# model2 = load_model('model2/')
model2 = model

# Evaluate the model on the test set
_, mse = model2.evaluate(X_test2, y_test2)
print(f'Test MSE: {mse:.4f}\n') #Test MSE: 0.1025, 0.1022, 0.1027, 0.1021, 0.1063

# make predictions
y_pred2 = model2.predict(X_test2).flatten()


Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_18 (LSTM)              (None, 4, 50)             10400     
                                                                 
 dropout_12 (Dropout)        (None, 4, 50)             0         
                                                                 
 lstm_19 (LSTM)              (None, 4, 50)             20200     
                                                                 
 dropout_13 (Dropout)        (None, 4, 50)             0         
                                                                 
 lstm_20 (LSTM)              (None, 50)                20200     
                                                                 
 dense_6 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai

### Close Return with OHLC with Lags

In [27]:
# adder of lags to the dataframe
def get_data_w_lags(df):
    df = df.copy()

    # forwarddelay = 1, 2, 3, 5, 10, 15, 20 of OHLC columns
    for lag in [1, 2, 3, 5, 10, 15, 20]:
        for col in ['Open', 'High', 'Low', 'Close']:
            df[col + '_stock_price_lag_f' + str(lag)] = df[col].shift(lag)

    return df

df_w_lags = get_data_w_lags(df_pct_change)

# define macros for the columns
MOVE_FORWARD = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
FEATURES = [str(col) for col in df_w_lags.columns if col not in ['target_Close']]
TARGET = ['target_Close']
delay = 1

df_w_lags_shift = df_w_lags.copy()
# shift the features one day forward so that each row has the features of the previous day and the target of the current day
df_w_lags_shift.loc[:, MOVE_FORWARD] = df_w_lags.loc[:, MOVE_FORWARD].shift(delay)
# with open('columns_droped.txt', 'a') as f:
#     f.write(str([str(col) for col in df_w_lags_shift.columns]))
df_w_lags_shift = df_w_lags_shift.dropna()#.reset_index()

display(df_w_lags_shift.head(1))

# move the TARGET columns to the end of the dataframe
df_w_lags_shift = df_w_lags_shift[[col for col in df_w_lags_shift.columns if col not in TARGET] + TARGET]

COL_TO_DROP = ['Volume']
df_w_lags_shift = df_w_lags_shift.drop(COL_TO_DROP, axis=1)

df_feature_w_target = df_w_lags_shift.copy()

display(df_feature_w_target.head(1))

Unnamed: 0_level_0,Open,High,Low,Close,Volume,target_Close,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,...,Low_stock_price_lag_f10,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-02-02,0.019408,0.019408,0.019408,0.019408,-0.266344,0.019231,-0.01677,-0.01677,-0.01677,-0.01677,...,0.004684,0.004684,-0.024586,-0.024586,-0.024586,-0.024586,-0.055319,-0.055319,-0.055319,-0.055319


Unnamed: 0_level_0,Open,High,Low,Close,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,Open_stock_price_lag_f2,High_stock_price_lag_f2,...,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-02-02,0.019408,0.019408,0.019408,0.019408,-0.01677,-0.01677,-0.01677,-0.01677,0.0,0.0,...,0.004684,-0.024586,-0.024586,-0.024586,-0.024586,-0.055319,-0.055319,-0.055319,-0.055319,0.019231


In [28]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.metrics import RootMeanSquaredError
from tensorflow.keras.optimizers import Adam

from absl import logging
logging.set_verbosity(logging.ERROR)

# Split the data into train and test sets
train_val_data = df_feature_w_target[:-252*3]
test_data = df_feature_w_target[-252*3:]
display(test_data)

# Split the train data into train and validation sets
train_data = train_val_data[:-252]
valid_data = train_val_data[-252:]

# Normalize the data using MinMaxScaler
scaler3 = MinMaxScaler()
train_data3 = scaler3.fit_transform(train_data)
valid_data3 = scaler3.transform(valid_data)
test_data3 = scaler3.transform(test_data)

# make test data a dataframe
test_data_trans = pd.DataFrame(test_data3, columns=df_feature_w_target.columns)
display(test_data_trans)
test_data_inv = scaler3.inverse_transform(test_data_trans)
test_data_inv = pd.DataFrame(test_data_inv, columns=df_feature_w_target.columns)
display(test_data_inv)

# Split the data into features and targets
X_train3 = train_data3[:, :-1]
y_train3 = train_data3[:, -1]
X_valid3 = valid_data3[:, :-1]
y_valid3 = valid_data3[:, -1]
X_test3 = test_data3[:, :-1]
y_test3 = test_data3[:, -1]

# Reshape the data for LSTM
X_train3 = np.reshape(X_train3, (*(X_train3.shape), 1))
X_valid3 = np.reshape(X_valid3, (*(X_valid3.shape), 1))
X_test3 = np.reshape(X_test3, (*(X_test3.shape), 1))

# Define the model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train3.shape[1], 1)))
model.add(Dropout(0.3))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(units=50))
model.add(Dense(units=1, activation='linear'))
# model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss=MeanSquaredError(), optimizer=Adam(learning_rate=0.001), metrics=[RootMeanSquaredError()])
model.summary()

# Train the model
# cp3 = ModelCheckpoint('model1/', save_best_only=True)
history = model.fit(X_train3, y_train3, epochs=5, batch_size=32, validation_data=(X_valid3, y_valid3))#, callbacks=[cp3])

# from tensorflow.keras.models import load_model
# model3 = load_model('model3/')
model3 = model

# Evaluate the model on the test set
_, mse = model3.evaluate(X_test3, y_test3)
print(f'Test MSE: {mse:.4f}\n') #Test MSE: 0.1025, 0.1022, 0.1027, 0.1023

# make predictions
y_pred3 = model3.predict(X_test3).flatten()

# fill test_data_trans with predictions



Unnamed: 0_level_0,Open,High,Low,Close,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,Open_stock_price_lag_f2,High_stock_price_lag_f2,...,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20,target_Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2019-12-10,0.004975,-0.003596,-0.000332,-0.004613,0.017206,0.006581,0.020664,0.009983,0.012987,0.024612,...,-0.008339,-0.002337,-0.000996,-0.011483,-0.020067,-0.001664,0.012052,0.051246,0.040054,0.007613
2019-12-11,-0.006601,-0.001313,-0.007304,0.007613,0.004975,-0.003596,-0.000332,-0.004613,0.017206,0.006581,...,-0.018163,-0.006024,-0.008305,-0.006833,-0.007508,0.015000,-0.009012,0.013202,-0.028479,-0.001314
2019-12-12,0.013621,0.008870,0.012040,-0.001314,-0.006601,-0.001313,-0.007304,0.007613,0.004975,-0.003596,...,0.004796,-0.015152,-0.012060,-0.011696,-0.010317,-0.019705,-0.011043,-0.004678,0.014324,0.018750
2019-12-13,0.004261,0.011071,0.007270,0.018750,0.013621,0.008870,0.012040,-0.001314,-0.006601,-0.001313,...,0.006819,-0.010940,0.012208,0.005221,0.037179,0.013050,0.011932,0.014076,-0.007573,-0.031967
2019-12-16,0.008159,-0.002577,-0.020013,-0.031967,0.004261,0.011071,0.007270,0.018750,0.013621,0.008870,...,-0.012868,0.028690,0.016415,0.023545,0.004355,-0.002996,-0.015681,-0.013329,-0.003998,-0.019012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-21,0.013122,0.032430,0.026340,0.030831,0.006375,0.018091,0.006044,0.014966,0.045714,-0.011623,...,0.000781,0.040169,0.054175,0.039932,0.041872,-0.030705,-0.028700,-0.021603,-0.004315,0.021673
2022-12-22,0.049576,0.020654,0.028816,0.021673,0.013122,0.032430,0.026340,0.030831,0.006375,0.018091,...,-0.011310,0.040650,0.030139,0.041667,0.050433,0.000000,0.009390,0.022422,0.004715,0.017819
2022-12-23,0.017021,0.035413,0.037199,0.017819,0.049576,0.020654,0.028816,0.021673,0.013122,0.032430,...,-0.022485,0.038281,0.004876,0.002353,-0.040135,0.022609,0.041438,0.019737,0.034556,0.047103
2022-12-26,0.009623,0.026873,0.018143,0.047103,0.017021,0.035413,0.037199,0.017819,0.049576,0.020654,...,-0.002825,-0.033484,-0.012691,-0.009781,0.012505,0.031463,-0.012180,0.015054,-0.016082,-0.007166


Unnamed: 0,Open,High,Low,Close,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,Open_stock_price_lag_f2,High_stock_price_lag_f2,...,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20,target_Close
0,0.090883,0.094552,0.416389,0.475083,0.092114,0.095628,0.460775,0.520412,0.091689,0.097535,...,0.463511,0.090147,0.094827,0.392815,0.427089,0.090215,0.096207,0.525427,0.613798,0.513051
1,0.089718,0.094793,0.401648,0.513051,0.090883,0.094552,0.416389,0.475083,0.092114,0.095628,...,0.433001,0.089776,0.094054,0.402645,0.466090,0.091892,0.093979,0.445001,0.400965,0.485327
2,0.091753,0.095870,0.442544,0.485327,0.089718,0.094793,0.401648,0.513051,0.090883,0.094552,...,0.504302,0.088858,0.093656,0.392364,0.457369,0.088400,0.093764,0.407201,0.533891,0.547637
3,0.090811,0.096103,0.432460,0.547637,0.091753,0.095870,0.442544,0.485327,0.089718,0.094793,...,0.510584,0.089282,0.096223,0.428128,0.604868,0.091696,0.096194,0.446849,0.465890,0.390134
4,0.091204,0.094659,0.374780,0.390134,0.090811,0.096103,0.432460,0.547637,0.091753,0.095870,...,0.449445,0.093270,0.096668,0.466867,0.502932,0.090081,0.093273,0.388911,0.476993,0.430364
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
751,0.091703,0.098362,0.472775,0.585156,0.091024,0.096845,0.429867,0.535886,0.094983,0.093703,...,0.491832,0.094425,0.100662,0.501509,0.619443,0.087293,0.091897,0.371419,0.476008,0.556715
752,0.095371,0.097116,0.478009,0.556715,0.091703,0.098362,0.472775,0.585156,0.091024,0.096845,...,0.454283,0.094473,0.098120,0.505177,0.646032,0.090382,0.095925,0.464491,0.504051,0.544747
753,0.092095,0.098677,0.495732,0.544747,0.095371,0.097116,0.478009,0.556715,0.091703,0.098362,...,0.419579,0.094235,0.095448,0.422064,0.364766,0.092658,0.099315,0.458815,0.596724,0.635689
754,0.091351,0.097774,0.455446,0.635689,0.092095,0.098677,0.495732,0.544747,0.095371,0.097116,...,0.480635,0.087013,0.093590,0.396412,0.528242,0.093549,0.093644,0.448915,0.439463,0.467155


Unnamed: 0,Open,High,Low,Close,Open_stock_price_lag_f1,High_stock_price_lag_f1,Low_stock_price_lag_f1,Close_stock_price_lag_f1,Open_stock_price_lag_f2,High_stock_price_lag_f2,...,Close_stock_price_lag_f10,Open_stock_price_lag_f15,High_stock_price_lag_f15,Low_stock_price_lag_f15,Close_stock_price_lag_f15,Open_stock_price_lag_f20,High_stock_price_lag_f20,Low_stock_price_lag_f20,Close_stock_price_lag_f20,target_Close
0,0.004975,-0.003596,-0.000332,-0.004613,0.017206,0.006581,0.020664,0.009983,0.012987,0.024612,...,-0.008339,-0.002337,-0.000996,-0.011483,-0.020067,-0.001664,0.012052,0.051246,0.040054,0.007613
1,-0.006601,-0.001313,-0.007304,0.007613,0.004975,-0.003596,-0.000332,-0.004613,0.017206,0.006581,...,-0.018163,-0.006024,-0.008305,-0.006833,-0.007508,0.015000,-0.009012,0.013202,-0.028479,-0.001314
2,0.013621,0.008870,0.012040,-0.001314,-0.006601,-0.001313,-0.007304,0.007613,0.004975,-0.003596,...,0.004796,-0.015152,-0.012060,-0.011696,-0.010317,-0.019705,-0.011043,-0.004678,0.014324,0.018750
3,0.004261,0.011071,0.007270,0.018750,0.013621,0.008870,0.012040,-0.001314,-0.006601,-0.001313,...,0.006819,-0.010940,0.012208,0.005221,0.037179,0.013050,0.011932,0.014076,-0.007573,-0.031967
4,0.008159,-0.002577,-0.020013,-0.031967,0.004261,0.011071,0.007270,0.018750,0.013621,0.008870,...,-0.012868,0.028690,0.016415,0.023545,0.004355,-0.002996,-0.015681,-0.013329,-0.003998,-0.019012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
751,0.013122,0.032430,0.026340,0.030831,0.006375,0.018091,0.006044,0.014966,0.045714,-0.011623,...,0.000781,0.040169,0.054175,0.039932,0.041872,-0.030705,-0.028700,-0.021603,-0.004315,0.021673
752,0.049576,0.020654,0.028816,0.021673,0.013122,0.032430,0.026340,0.030831,0.006375,0.018091,...,-0.011310,0.040650,0.030139,0.041667,0.050433,0.000000,0.009390,0.022422,0.004715,0.017819
753,0.017021,0.035413,0.037199,0.017819,0.049576,0.020654,0.028816,0.021673,0.013122,0.032430,...,-0.022485,0.038281,0.004876,0.002353,-0.040135,0.022609,0.041438,0.019737,0.034556,0.047103
754,0.009623,0.026873,0.018143,0.047103,0.017021,0.035413,0.037199,0.017819,0.049576,0.020654,...,-0.002825,-0.033484,-0.012691,-0.009781,0.012505,0.031463,-0.012180,0.015054,-0.016082,-0.007166


Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_21 (LSTM)              (None, 32, 50)            10400     
                                                                 
 dropout_14 (Dropout)        (None, 32, 50)            0         
                                                                 
 lstm_22 (LSTM)              (None, 32, 50)            20200     
                                                                 
 dropout_15 (Dropout)        (None, 32, 50)            0         
                                                                 
 lstm_23 (LSTM)              (None, 50)                20200     
                                                                 
 dense_7 (Dense)             (None, 1)                 51        
                                                                 
Total params: 50,851
Trainable params: 50,851
Non-trai