Time series forecasting task using the "close" parameter in a dataset. The last time steps to predict one time step forward. Train-test split performance with 20% for testing, ensuring that the dataset sequences are not mixed.

Imports

In [67]:
import pandas as pd
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler




Load dataset

In [68]:
url = "https://gist.githubusercontent.com/Foxicution/9f9e25a45147b0ef5f262e565d78fe28/raw/25e7fe42b0f8368266f783c4920d540de8c26731/data.csv"
data = pd.read_csv(url)
# data = data[::-1]
display(data)
len(data)

Unnamed: 0,symbol,MSFT,MSFT.1,MSFT.2,MSFT.3,MSFT.4
0,,open,high,low,close,volume
1,"[2022-10-11 12:30:00-04:00, 2022-10-11 13:30:0...",228.38999938964844,229.05999755859375,227.38999938964844,227.60000610351562,2247906.0
2,"[2022-10-11 13:30:00-04:00, 2022-10-11 14:30:0...",227.60499572753906,228.5,226.86000061035156,227.60000610351562,2307918.0
3,"[2022-10-11 14:30:00-04:00, 2022-10-11 15:30:0...",227.58770751953125,228.02999877929688,224.11000061035156,225.05999755859375,4256199.0
4,"[2022-10-11 15:30:00-04:00, 2022-10-11 16:30:0...",225.08999633789062,225.97000122070312,224.49000549316406,225.44000244140625,3300724.0
...,...,...,...,...,...,...
2542,"[2024-03-25 14:30:00-04:00, 2024-03-25 15:30:0...",423.8500061035156,424.3949890136719,423.67999267578125,423.75,1065865.0
2543,"[2024-03-25 15:30:00-04:00, 2024-03-25 16:30:0...",423.739990234375,423.89990234375,422.32000732421875,422.8900146484375,2365026.0
2544,"[2024-03-26 09:30:00-04:00, 2024-03-26 10:30:0...",425.5050048828125,425.9800109863281,422.8999938964844,424.4346923828125,2623555.0
2545,"[2024-03-26 10:30:00-04:00, 2024-03-26 11:30:0...",424.4100036621094,424.7099914550781,423.67999267578125,424.0,1499059.0


2547

In [69]:
print(data.columns)
print(data.head())

Index(['symbol', 'MSFT', 'MSFT.1', 'MSFT.2', 'MSFT.3', 'MSFT.4'], dtype='object')
                                              symbol                MSFT  \
0                                                NaN                open   
1  [2022-10-11 12:30:00-04:00, 2022-10-11 13:30:0...  228.38999938964844   
2  [2022-10-11 13:30:00-04:00, 2022-10-11 14:30:0...  227.60499572753906   
3  [2022-10-11 14:30:00-04:00, 2022-10-11 15:30:0...  227.58770751953125   
4  [2022-10-11 15:30:00-04:00, 2022-10-11 16:30:0...  225.08999633789062   

               MSFT.1              MSFT.2              MSFT.3     MSFT.4  
0                high                 low               close     volume  
1  229.05999755859375  227.38999938964844  227.60000610351562  2247906.0  
2               228.5  226.86000061035156  227.60000610351562  2307918.0  
3  228.02999877929688  224.11000061035156  225.05999755859375  4256199.0  
4  225.97000122070312  224.49000549316406  225.44000244140625  3300724.0  


Change NaN to a date

In [70]:
data['symbol'].iloc[0] = 'date'

data['symbol'] = data['symbol'].fillna(method='ffill')

print(data.head())

                                              symbol                MSFT  \
0                                               date                open   
1  [2022-10-11 12:30:00-04:00, 2022-10-11 13:30:0...  228.38999938964844   
2  [2022-10-11 13:30:00-04:00, 2022-10-11 14:30:0...  227.60499572753906   
3  [2022-10-11 14:30:00-04:00, 2022-10-11 15:30:0...  227.58770751953125   
4  [2022-10-11 15:30:00-04:00, 2022-10-11 16:30:0...  225.08999633789062   

               MSFT.1              MSFT.2              MSFT.3     MSFT.4  
0                high                 low               close     volume  
1  229.05999755859375  227.38999938964844  227.60000610351562  2247906.0  
2               228.5  226.86000061035156  227.60000610351562  2307918.0  
3  228.02999877929688  224.11000061035156  225.05999755859375  4256199.0  
4  225.97000122070312  224.49000549316406  225.44000244140625  3300724.0  


  data['symbol'] = data['symbol'].fillna(method='ffill')


Check if normalization is needed


In [71]:
close_stats = data['MSFT.3'].describe()
print("Summary Statistics for Close Prices:")
print(close_stats)

Summary Statistics for Close Prices:
count                  2547
unique                 2395
top       328.6300048828125
freq                      4
Name: MSFT.3, dtype: object


Prep a dataset for time series forecasting

In [72]:
data = data.sort_index(ascending=False, axis=0).reset_index(drop=True)

new_dataset = pd.DataFrame(index=range(0, len(data)), columns=['date', 'close'])

for i in range(0, len(data)):
    new_dataset["date"][i] = data['symbol'][i]
    new_dataset["close"][i] = data["MSFT.3"][i]

train_data = new_dataset.iloc[:987]
valid_data = new_dataset.iloc[987:]

new_dataset.index = new_dataset.date
new_dataset.drop("date", axis=1, inplace=True)

print(new_dataset.tail(5))

print(new_dataset.tail(5))

                                                                 close
date                                                                  
[2022-10-11 15:30:00-04:00, 2022-10-11 16:30:00...  225.44000244140625
[2022-10-11 14:30:00-04:00, 2022-10-11 15:30:00...  225.05999755859375
[2022-10-11 13:30:00-04:00, 2022-10-11 14:30:00...  227.60000610351562
[2022-10-11 12:30:00-04:00, 2022-10-11 13:30:00...  227.60000610351562
date                                                             close
                                                                 close
date                                                                  
[2022-10-11 15:30:00-04:00, 2022-10-11 16:30:00...  225.44000244140625
[2022-10-11 14:30:00-04:00, 2022-10-11 15:30:00...  225.05999755859375
[2022-10-11 13:30:00-04:00, 2022-10-11 14:30:00...  227.60000610351562
[2022-10-11 12:30:00-04:00, 2022-10-11 13:30:00...  227.60000610351562
date                                                             close


In [73]:
train_data.shape

(987, 2)

In [74]:
valid_data.shape

(1560, 2)

Scaling