---
<h1 align="center"> PatchTST Architecture</h1>

---

PatchTST is a new Transformer architecture that incorporates two key concepts for time series forecasting: channel independence and patching.
Channel independence involves decomposing multichannel sequences into single channels before input to the model, allowing for handling different types of data.
Patching divides the input sequence into smaller parts, or patches "window", allowing the model to focus on similar group of data.
PatchTST is evaluated the effectiveness of the model on multichannel time series forecasting tasks where both the input and output data are multichannel.* PatchTST is a new Transformer architecture that incorporates two key concepts for time series forecasting: channel independence and patching.
* Channel independence involves decomposing multichannel sequences into single channels before input to the model, allowing for handling different types of data. 
* Patching divides the input sequence into smaller parts, or patches "window", allowing the model to focus on similar group of data.
* PatchTST is evaluated the effectiveness of the model on multichannel time series forecasting tasks where both the input and output data are multichannel.

<img width="1000" src="Images/PatchTST.png" alt="PatchTST Archticture" align="center" />

---
## Inputs and Outputs
---
##### **Input sequence:**
* The input to the PatchTST algorithm is a time series data that can have multiple channels (features). The multichannel input is decomposed into single channels using the channel independence concept, then PatchTST model applies patching to divide the input sequence into smaller parts or patches, allowing the model to focus on local patterns while avoiding memory constraints and facilitating quicker inference.

* The length of the input sequence is defined by the `seq_len` hyperparameter.


##### **Output sequence:**
* The output of the PatchTST model is a sequence of multichannel time series of predicted values, where each value corresponds to a prediction made at a specific time step in the output sequence.

* The length of the output sequence is defined by the `pred_len` hyperparameter.

---
## Methodology
---

- This notebook provide a step-by-step guide for replicating the **PatchTST** model and training it on the ETDataset (ETTh1, ETTh2, ETTm1, and ETTm2), ensuring accurate reproduction of the models by comparing the notebook results with the official paper results. The main foucs here is to study the *impact of a number of prediction length and the number of input patches on the performance*. The workflow from the beginning to forcasting is as follows:


### 1. Data Preparation: 
* The author preprocessed the ETT dataset by normalizing the input features and splitting the data into training, validation, and test sets.

### 2. Replicat Model Architecture: 
* Using the PatchTST architecture with channel independence and patching. Then reconstructing the PatchTST model to output a single-channel sequence instead of a multichannel sequence.

### 3. Training: 
* Training the PatchTST model on the training set and validated it on the validation set. We used the Mean Absolute Percentage Error (MAPE) as the evaluation metric.

### 4. Hyperparameter Tuning: 
* Tune the hyperparameters of the PatchTST model using a grid search approach. We experimented with different values for the number of layers, the number of heads, the patch length, and the stride.

### 5. Testing: 
* Evaluating the performance of the PatchTST model on the test set and compared it to the performance of simple DNN models.

### 6. Results: 
* We will apply the PatchTST model on ETT datasets with different number of patches and prediction lenght according to the trials In the original PatchTST paper and compare the results.


---
## Hyperparameters
---

**Here are the hyperparameters that control the input and output of PatchTST:**

| Parameter | Description | Value |
|---|---|---|
| `args.data` | The name of the dataset to use. | 'ETTh1', 'ETTh2', 'ETTm1', and 'ETTm2'|
| `args.root_path` | The root path of the data file. | `./Datasets/` |
| `args.features` | The type of forecasting task to perform. The options are 'M' (multivariate predict multivariate), 'S' (univariate predict univariate), and 'MS' (multivariate predict univariate). | 'M'|
| `args.target` | The target feature to predict in a univariate or multivariate task. ('HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT')| 'OT' |
| `args.seq_len` | The length of the input sequence to the Informer encoder. | 336|
| `args.pred_len` | The length of the future sequence to be predicted. | [96, 192, 336, 720]|
| `args.patch_len` | The size of each patch | 24 |
| `args.num_patch` | The number of input patches| [42, 64]  |
| `args.stride` | The step size used to slide the patch window across the input time series data during the patching process | 8|
| `args.batch_size` |  The size of one batch in training | 128 |
| `args.learning_rate` | Learning rate  | 0.0001 |
| `args.patience` | The number of epochs to wait before early stopping | 10 |
| `args.train_epochs` | Number of epochs in train | 20 |
| `args.padding` | The amount of padding to add to the input sequence, if any. | 0|
| `args.freq` |  freq for time features encoding, options:[s:secondly, t:minutely, h:hourly, d:daily, b:business days, w:weekly, m:monthly], you can also use more detailed freq like 15min or 3h | 'h' |
| `args.embed` | The type of time feature encoding to use. The options are 'timeF' (time features encoding), 'fixed' (fixed positional encoding), and 'learned' (learned positional encoding). | 'timeF' |

---------
- **ProbSparse Attention:** 

| Parameter | Description | Value |
|---|---|---|
| `args.attn` | The type of attention used in the encoder. The options are 'prob' (probabilistic sparse attention) and 'full' (full attention). | 'prob' |
| `args.n_heads` | The number of attention heads in the encoder. | 16 |
| `args.factor` | The ProbSparse attention factor. A higher value of factor results in a sparser attention matrix. | 5 |
| `args.dropout` | The dropout probability applied to the attention weights. | 0.1 |
| `args.d_model` | The dimension of the model. | 512 |

---
- **Feedforward Network:** 

| Parameter | Description | Value |
|---|---|---|
| `args.d_ff` | The dimension of the feedforward network. | 2048 |
| `args.activation` | The activation function used in the feedforward network. | 'gelu' |
| `args.dropout` | The dropout probability applied to the attention weights. | 0.1 |

---
- **Mixing Layer:**

The model includes a mixing layer that linearly combines the outputs of the attention heads in the encoder and decoder, which helps to improve the model's performance. Here are the hyperparameters that control the mixing layer:

| Parameter | Description | Value |
|---|---|---|
| `args.mix` | Whether to apply a linear projection to the concatenated outputs of the attention heads. | True |
| `args.d_model` | The dimension of the model. | 512 |

---
**The following are the experiment Hyperparameters.**


| Parameter | Description | Value |
|---|---|---|
| `args.output_attention` | Whether to output attention in ecoder | False |
| `args.use_amp` | Whether to use automatic mixed precision training | False |
| `args.train_only` | Whether to train the model or fine-tune | True |
| `args.train_epochs` | The number of epochs to train for. | 8 |
| `args.batch_size` | The batch size of training input data. | 32 |
| `args.learning_rate` | Learning rate starts from 1e−4, decaying two times smaller every epoch. | 0.0001 |
| `args.lradj` | Learning rate decayed two times smaller every epoch. | 'type1' |
| `args.loss` | Evaluating criteria | `'mse'` |
| `args.patience` | The number of epochs to wait before early stopping. | 3 |
| `args.des` | The description of the experiment. | 'test' |
| `args.itr` | The iteration of the experiment. | 1 |
| `args.model` | The model name | 'informer' |
| `args.checkpoints` | Location of model checkpoints | `'./Checkpoints/Informer_checkpoints'` |


---
# Setup
---

**Add project_files to system path**

In [None]:
import os
os.chdir('./Transformer-based-solutions-for-the-long-term-time-series-forecasting')

In [None]:
import sys
if not 'Transformer-based-solutions-for-the-long-term-time-series-forecasting' in sys.path:
    sys.path += ['Transformer-based-solutions-for-the-long-term-time-series-forecasting']
    
sys.path

**Important library**

In [3]:
import torch
from torch.utils.data import DataLoader
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
import os
from exp.exp_PatchTST import Exp_Main #, Dataset_Pred

In [4]:
class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

---
# Working on ETT Dataset


* The Electricity Transformer Temperature (ETT) is a crucial indicator in the electric power long-term deployment. This dataset consists of 2 years data from two separated counties in .There are different subsets, {ETTh1, ETTh2} for 1-hour-level and ETTm1 for 15-minutes-level. Each data point consists of the target value ”oil temperature” and 6 power load features. The train/val/test is 12/4/4 months.

---
# Working on ETTh1 Dataset
---

## Trail 1: PatchTST/42, Dataset:ETTh1,  Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [65]:
"""
    **dotdict function**
    This function is used to convert a dictionary into
    an object whose keys can be accessed as attributes
"""

args = dotdict()

args.model = 'PatchTST'   # Model Name
args.random_seed = 2021
args.is_training = 1
args.model_id = f"{args.data}_{args.seq_len}_{args.pred_len}"
args.fc_dropout = 0.3
args.head_dropout = 0
args.patch_len = 24 # The size of each patch
args.num_patch = 42  # The number of input patches (42 64)  
args.stride = 8 # The step size used to slide the patch window across the input time series data during the patching process.
args.batch_size = 32 # The size of one batch in training
args.learning_rate = 0.0001 # Learning rate
args.pred_len = 96 # prediction sequence length   (96, 192, 336, 720)  
args.patience = 10  # The number of epochs to wait before early stopping.
args.train_epochs = 20   # Number of epochs in train

args.use_multi_gpu = False 
args.use_gpu = True if torch.cuda.is_available() else False # Using GPU if cuda is available 
args.learning_rate = 0.005 
args.label_len = 48 # start token length of PatchTST decoder
args.use_amp = False # whether to use automatic mixed precision training
args.output_attention = False # whether to output attention in ecoder
args.features = 'M' # forecasting task, options:[M, S, MS]; M:multivariate predict multivariate, S:univariate predict univariate, MS:multivariate predict univariate
args.train_only=True
args.checkpoints = './PatchTST_checkpoints' # location of model checkpoints


args.data = 'ETTh1'  # data
args.root_path = './ETDataset/' # root path of data file
args.data_path = 'ETTh1.csv' # data file
args.target = 'OT' # target feature in S or MS task
args.freq = 'h' # freq for time features encoding, options:[s:secondly, t:minutely, h:hourly, d:daily, b:business days, w:weekly, m:monthly], you can also use more detailed freq like 15min or 3h
args.seq_len = 336 # input sequence length of PatchTST encoder

# PatchTST decoder input: concat[start token series(label_len), zero padding series(pred_len)]
args.enc_in = 7 # encoder input size
args.dec_in = 7 # decoder input size
args.c_out = 7 # output size
args.factor = 5 # probsparse attn factor
args.d_model = 16 # dimension of model
args.n_heads = 4 # num of heads
args.e_layers = 3 # num of encoder layers
args.d_layers = 1 # num of decoder layers
args.d_ff = 128   # dimension of fcn in model
args.dropout =0.3# 0.05 # dropout
args.attn = 'prob' # attention used in encoder, options:[prob, full]
args.embed = 'timeF' # time features encoding, options:[timeF, fixed, learned]
args.activation = 'gelu' # activation
args.distil = True # whether to use distilling in encoder
args.mix = True
args.padding = 0
#args.freq = 'h'   # # freq for time features encoding, options:[s:secondly, t:minutely, h:hourly, d:daily, b:business days, w:weekly, m:monthly], you can also use more detailed freq like 15min or 3h
args.loss = 'mse'   # evaluating criteria
args.lradj = 'type1'  # learning rate decayed two times smaller every epoch.
args.num_workers = 0
args.itr = 1
args.des = "Exp"   # The description of the experiment.
args.gpu = 0
args.devices = '0,1,2,3'

In [66]:
args.use_gpu = True if torch.cuda.is_available() and args.use_gpu else False
if args.use_gpu and args.use_multi_gpu:
    args.devices = args.devices.replace(' ','')
    device_ids = args.devices.split(',')
    args.device_ids = [int(id_) for id_ in device_ids]
    args.gpu = args.device_ids[0]
    
print("Hyperparameter Combination of Trail 1: ") 
print(args)

Hyperparameter Combination of Trail 1: 
{'model': 'PatchTST', 'random_seed': 2021, 'is_training': 1, 'model_id': 'None_None_None', 'fc_dropout': 0.3, 'head_dropout': 0, 'patch_len': 24, 'num_patch': 42, 'stride': 8, 'batch_size': 32, 'learning_rate': 0.005, 'pred_len': 96, 'patience': 10, 'train_epochs': 20, 'use_multi_gpu': False, 'use_gpu': True, 'label_len': 48, 'use_amp': False, 'output_attention': False, 'features': 'M', 'train_only': True, 'checkpoints': './PatchTST_checkpoints', 'data': 'ETTh1', 'root_path': './ETDataset/', 'data_path': 'ETTh1.csv', 'target': 'OT', 'freq': 'h', 'seq_len': 336, 'enc_in': 7, 'dec_in': 7, 'c_out': 7, 'factor': 5, 'd_model': 16, 'n_heads': 4, 'e_layers': 3, 'd_layers': 1, 'd_ff': 128, 'dropout': 0.3, 'attn': 'prob', 'embed': 'timeF', 'activation': 'gelu', 'distil': True, 'mix': True, 'padding': 0, 'loss': 'mse', 'lradj': 'type1', 'num_workers': 0, 'itr': 1, 'des': 'Exp', 'gpu': 0, 'devices': '0,1,2,3'}


### Training

In [67]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8209
val 2785
test 2785
	iters: 100, epoch: 1 | loss: 0.5021333
	speed: 0.1957s/iter; left time: 982.6902s
	iters: 200, epoch: 1 | loss: 0.4459157
	speed: 0.1831s/iter; left time: 900.9876s
Epoch: 1 cost time: 49.07178854942322
Epoch: 1, Steps: 256 | Train Loss: 0.5326953 Vali Loss: 0.7148266 Test Loss: 0.4131814
Validation loss decreased (inf --> 0.714827).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.3932944
	speed: 0.7549s/iter; left time: 3597.2838s
	iters: 200, epoch: 2 | loss: 0.3659179
	speed: 0.2130s/iter; left time: 993.4122s
Epoch: 2 cost time: 51.905097007751465
Epoch: 2, Steps: 256 | Train Loss: 0.4019561 Vali Loss: 0.7034194 Test Loss: 0.3983363
Validation loss decreased (0.714827 --> 0.703419).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3327482
	speed: 0.7451s/iter; left time: 3359.5105s
	iters: 200, epoch: 3 | loss: 0.3363656
	speed: 0.2110s/iter; left time: 930.1154s
Epoch

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [68]:
exp.test(setting)
torch.cuda.empty_cache()

test 2785
mse:0.3890248239040375, mae:0.40697768330574036, rse:0.5924373865127563


---
## Trail 2: PatchTST/64, Dataset:ETTh1 , Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [24]:
args.pred_len = 96 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1,  Prediction Length : 96


### Training

In [25]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8209
val 2785
test 2785
	iters: 100, epoch: 1 | loss: 0.5467896
	speed: 0.4824s/iter; left time: 1833.4442s
Epoch: 1 cost time: 91.73658847808838
Epoch: 1, Steps: 195 | Train Loss: 0.5800090 Vali Loss: 0.8040251 Test Loss: 0.4371468
Validation loss decreased (inf --> 0.804025).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.3973847
	speed: 1.6531s/iter; left time: 5961.1725s
Epoch: 2 cost time: 103.09685468673706
Epoch: 2, Steps: 195 | Train Loss: 0.4063360 Vali Loss: 0.6899766 Test Loss: 0.4076010
Validation loss decreased (0.804025 --> 0.689977).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3191050
	speed: 1.7549s/iter; left time: 5985.8728s
Epoch: 3 cost time: 106.60452842712402
Epoch: 3, Steps: 195 | Train Loss: 0.3574004 Vali Loss: 0.7909207 Test Loss: 0.4054973
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.3109620
	speed: 1.7951s/it

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=48, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [26]:
exp.test(setting)
torch.cuda.empty_cache()

test 2785
mse:0.40297746658325195, mae:0.41574206948280334, rse:0.6028541922569275


---
## Trail 3: PatchTST/42, Dataset:ETTh1 , Metric: 192
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [27]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1,  Prediction Length : 192


### Training

In [28]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8113
val 2689
test 2689
	iters: 100, epoch: 1 | loss: 0.4995359
	speed: 0.7013s/iter; left time: 2637.6064s
Epoch: 1 cost time: 132.8545961380005
Epoch: 1, Steps: 193 | Train Loss: 0.6192046 Vali Loss: 0.9940767 Test Loss: 0.4705402
Validation loss decreased (inf --> 0.994077).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4622335
	speed: 3.4180s/iter; left time: 12195.3505s
Epoch: 2 cost time: 132.79999923706055
Epoch: 2, Steps: 193 | Train Loss: 0.4497860 Vali Loss: 0.9738395 Test Loss: 0.4425929
Validation loss decreased (0.994077 --> 0.973839).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3778394
	speed: 3.4640s/iter; left time: 11690.9593s
Epoch: 3 cost time: 134.50500321388245
Epoch: 3, Steps: 193 | Train Loss: 0.3972456 Vali Loss: 1.0681245 Test Loss: 0.4415466
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.3657199
	speed: 3.4542s/

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=48, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [29]:
exp.test(setting)
torch.cuda.empty_cache()

test 2689
mse:0.4425927400588989, mae:0.44522473216056824, rse:0.6317625045776367


## Trail 4: PatchTST/64, Dataset:ETTh1, Metric: 192
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [30]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1,  Prediction Length : 192


### Training

In [31]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8113
val 2689
test 2689
	iters: 100, epoch: 1 | loss: 0.5080305
	speed: 0.7014s/iter; left time: 2637.9586s
Epoch: 1 cost time: 138.65479850769043
Epoch: 1, Steps: 193 | Train Loss: 0.5953040 Vali Loss: 0.9777954 Test Loss: 0.4816293
Validation loss decreased (inf --> 0.977795).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4542234
	speed: 3.5243s/iter; left time: 12574.5809s
Epoch: 2 cost time: 138.20450687408447
Epoch: 2, Steps: 193 | Train Loss: 0.4473222 Vali Loss: 0.9944211 Test Loss: 0.4389979
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4092036
	speed: 3.5358s/iter; left time: 11933.3309s
Epoch: 3 cost time: 138.68963861465454
Epoch: 3, Steps: 193 | Train Loss: 0.3981723 Vali Loss: 1.0615439 Test Loss: 0.4434692
EarlyStopping counter: 2 out of 10
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.3955112
	speed: 3.5299s/iter; left time: 11232.2908s
Epoc

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=48, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [32]:
exp.test(setting)
torch.cuda.empty_cache()

test 2689
mse:0.4816294014453888, mae:0.4722605049610138, rse:0.6590345501899719


---
## Trail 5: PatchTST/42, Dataset:ETTh1,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [33]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1,  Prediction Length : 336


### Training

In [34]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7969
val 2545
test 2545
	iters: 100, epoch: 1 | loss: 0.5583267
	speed: 0.5613s/iter; left time: 2066.2315s
Epoch: 1 cost time: 108.53734970092773
Epoch: 1, Steps: 189 | Train Loss: 0.6319610 Vali Loss: 1.1278813 Test Loss: 0.5054743
Validation loss decreased (inf --> 1.127881).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4190464
	speed: 2.9080s/iter; left time: 10154.8707s
Epoch: 2 cost time: 116.59747886657715
Epoch: 2, Steps: 189 | Train Loss: 0.4895814 Vali Loss: 1.1570295 Test Loss: 0.4693138
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4272126
	speed: 2.9900s/iter; left time: 9875.9737s
Epoch: 3 cost time: 114.30476760864258
Epoch: 3, Steps: 189 | Train Loss: 0.4388608 Vali Loss: 1.3474638 Test Loss: 0.4732141
EarlyStopping counter: 2 out of 10
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.4151227
	speed: 2.7710s/iter; left time: 8628.8553s
Epoch:

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=48, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [35]:
exp.test(setting)
torch.cuda.empty_cache()

test 2545
mse:0.5054745078086853, mae:0.4913105070590973, rse:0.6770716905593872


---
## Trail 6: PatchTST/64, Dataset:ETTh1,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [36]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1,  Prediction Length : 336


### Training

In [37]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7969
val 2545
test 2545
	iters: 100, epoch: 1 | loss: 0.5901074
	speed: 0.4463s/iter; left time: 1642.9432s
Epoch: 1 cost time: 79.73448538780212
Epoch: 1, Steps: 189 | Train Loss: 0.6334951 Vali Loss: 1.1467382 Test Loss: 0.5077454
Validation loss decreased (inf --> 1.146738).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4579918
	speed: 1.9870s/iter; left time: 6938.4719s
Epoch: 2 cost time: 88.40225148200989
Epoch: 2, Steps: 189 | Train Loss: 0.4895072 Vali Loss: 1.1918480 Test Loss: 0.4653360
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4123957
	speed: 2.0161s/iter; left time: 6659.2653s
Epoch: 3 cost time: 75.4123055934906
Epoch: 3, Steps: 189 | Train Loss: 0.4425045 Vali Loss: 1.1529659 Test Loss: 0.4635212
EarlyStopping counter: 2 out of 10
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.5498210
	speed: 1.8259s/iter; left time: 5685.7970s
Epoch: 4 co

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=48, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [38]:
exp.test(setting)
torch.cuda.empty_cache()

test 2545
mse:0.507745623588562, mae:0.4910207688808441, rse:0.6785910725593567


---
## Trail 7: PatchTST/42, Dataset:ETTh1, Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [39]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1,  Prediction Length : 720


### Training

In [40]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7585
val 2161
test 2161
	iters: 100, epoch: 1 | loss: 0.6874392
	speed: 0.5746s/iter; left time: 2011.7808s
Epoch: 1 cost time: 102.16214656829834
Epoch: 1, Steps: 180 | Train Loss: 0.6947509 Vali Loss: 1.2598931 Test Loss: 0.5196058
Validation loss decreased (inf --> 1.259893).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.5145122
	speed: 2.2390s/iter; left time: 7435.7632s
Epoch: 2 cost time: 97.10556840896606
Epoch: 2, Steps: 180 | Train Loss: 0.5511002 Vali Loss: 1.5157858 Test Loss: 0.6210873
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4539023
	speed: 2.1410s/iter; left time: 6724.7364s
Epoch: 3 cost time: 97.00365948677063
Epoch: 3, Steps: 180 | Train Loss: 0.4978741 Vali Loss: 1.3462009 Test Loss: 0.5808331
EarlyStopping counter: 2 out of 10
Updating learning rate to 0.00125
	iters: 100, epoch: 4 | loss: 0.4298618
	speed: 2.3790s/iter; left time: 7044.2243s
Epoch: 4 

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=48, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [41]:
exp.test(setting)
torch.cuda.empty_cache()

test 2161
mse:0.5196058750152588, mae:0.5248337984085083, rse:0.6904626488685608


---
## Trail 8: PatchTST/64, Dataset:ETTh1, Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [6]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1, Prediction Length: 336, Number of Patches: 64


### Training

In [7]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7969
val 2545
test 2545
	iters: 100, epoch: 1 | loss: 0.6037611
	speed: 0.5880s/iter; left time: 2869.9431s
	iters: 200, epoch: 1 | loss: 0.5447749
	speed: 0.5710s/iter; left time: 2730.1475s
Epoch: 1 cost time: 143.60614132881165
Epoch: 1, Steps: 249 | Train Loss: 0.6057979 Vali Loss: 1.1328398 Test Loss: 0.4814026
Validation loss decreased (inf --> 1.132840).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.5242595
	speed: 2.8460s/iter; left time: 13182.7123s
	iters: 200, epoch: 2 | loss: 0.4272539
	speed: 0.5520s/iter; left time: 2501.6329s
Epoch: 2 cost time: 137.59608268737793
Epoch: 2, Steps: 249 | Train Loss: 0.4929329 Vali Loss: 1.2551457 Test Loss: 0.4989186
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3851923
	speed: 2.7419s/iter; left time: 12017.9617s
	iters: 200, epoch: 3 | loss: 0.3962872
	speed: 0.5690s/iter; left time: 2437.0923s
Epoch: 3 cost time: 141.91222023

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [8]:
exp.test(setting)
torch.cuda.empty_cache()

test 2545
mse:0.4814022183418274, mae:0.4737595319747925, rse:0.6606630086898804


### Compare our results with paper results

In [None]:
from IPython.display import Image
Image(filename=r"./Images/ETT1.png")

#### **Experiment Results**:
Comaparing my results with the paper resulted highlited in the image above.

| PatchTST/42 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTh1 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTh1 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTh1 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTh1 | 720 | 0.470274418592453| 0.4867483675479889 |


| PatchTST/64 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTh1 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTh1 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTh1 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTh1 | 720 | 0.470274418592453| 0.4867483675479889 |

---
# Working on ETTh2 Dataset
---

## Trail 1: PatchTST/42, Dataset:ETTh2,  Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [9]:
args.data_path = 'ETTh2.csv' # data file
args.data = 'ETTh2'  # data
args.pred_len = 96 # prediction sequence length
args.num_patch = 42  # The number of input patches

print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 96, Number of Patches: 42


### Training

In [10]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8209
val 2785
test 2785
	iters: 100, epoch: 1 | loss: 0.3296215
	speed: 0.2841s/iter; left time: 1426.3843s
	iters: 200, epoch: 1 | loss: 0.4071481
	speed: 0.2830s/iter; left time: 1392.6717s
Epoch: 1 cost time: 72.61269783973694
Epoch: 1, Steps: 256 | Train Loss: 0.6070960 Vali Loss: 0.3094437 Test Loss: 0.4443583
Validation loss decreased (inf --> 0.309444).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4269828
	speed: 0.9280s/iter; left time: 4422.0935s
	iters: 200, epoch: 2 | loss: 0.2938041
	speed: 0.2840s/iter; left time: 1325.0090s
Epoch: 2 cost time: 72.60372042655945
Epoch: 2, Steps: 256 | Train Loss: 0.4624461 Vali Loss: 0.2535830 Test Loss: 0.3750189
Validation loss decreased (0.309444 --> 0.253583).  Saving model ...
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3597878
	speed: 0.9290s/iter; left time: 4188.6407s
	iters: 200, epoch: 3 | loss: 0.2856484
	speed: 0.2851s/iter; left time: 1256.9543s
Ep

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [11]:
exp.test(setting)
torch.cuda.empty_cache()

test 2785
mse:0.2877921462059021, mae:0.35264715552330017, rse:0.432295024394989


---
## Trail 2: PatchTST/64, Dataset:ETTh2, Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [20]:
args.pred_len = 96 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 96, Number of Patches: 64


### Training

In [21]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8209
val 2785
test 2785
	iters: 100, epoch: 1 | loss: 0.7571195
	speed: 0.2093s/iter; left time: 1051.0720s
	iters: 200, epoch: 1 | loss: 0.7486441
	speed: 0.1912s/iter; left time: 940.7164s
Epoch: 1 cost time: 50.73831534385681
Epoch: 1, Steps: 256 | Train Loss: 0.5849259 Vali Loss: 0.2624639 Test Loss: 0.3637852
Validation loss decreased (inf --> 0.262464).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.3266740
	speed: 0.7429s/iter; left time: 3539.7215s
	iters: 200, epoch: 2 | loss: 0.3405324
	speed: 0.2360s/iter; left time: 1100.7197s
Epoch: 2 cost time: 60.89007520675659
Epoch: 2, Steps: 256 | Train Loss: 0.4569192 Vali Loss: 0.2672715 Test Loss: 0.3213788
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.5169606
	speed: 0.7460s/iter; left time: 3363.7368s
	iters: 200, epoch: 3 | loss: 0.3054450
	speed: 0.2370s/iter; left time: 1045.0616s
Epoch: 3 cost time: 61.69784593582153

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [22]:
exp.test(setting)
torch.cuda.empty_cache()

test 2785
mse:0.31302785873413086, mae:0.3729207217693329, rse:0.4508501887321472


---
## Trail 3: PatchTST/42, Dataset:ETTh2,  Metric: 192

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [23]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 192, Number of Patches: 42


### Training

In [24]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8113
val 2689
test 2689
	iters: 100, epoch: 1 | loss: 0.4758472
	speed: 0.2090s/iter; left time: 1036.7490s
	iters: 200, epoch: 1 | loss: 0.3944894
	speed: 0.2820s/iter; left time: 1370.8789s
Epoch: 1 cost time: 62.80650758743286
Epoch: 1, Steps: 253 | Train Loss: 0.6672120 Vali Loss: 0.3772666 Test Loss: 0.5099685
Validation loss decreased (inf --> 0.377267).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.7683532
	speed: 1.1720s/iter; left time: 5517.8984s
	iters: 200, epoch: 2 | loss: 0.4299730
	speed: 0.2690s/iter; left time: 1239.7452s
Epoch: 2 cost time: 69.69929361343384
Epoch: 2, Steps: 253 | Train Loss: 0.5342177 Vali Loss: 0.3972457 Test Loss: 0.7384973
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.6663211
	speed: 1.1249s/iter; left time: 5011.5550s
	iters: 200, epoch: 3 | loss: 0.2616723
	speed: 0.2300s/iter; left time: 1001.7356s
Epoch: 3 cost time: 63.9035964012146

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [25]:
exp.test(setting)
torch.cuda.empty_cache()

test 2689
mse:0.39005810022354126, mae:0.41103923320770264, rse:0.5008021593093872


---
## Trail 4: PatchTST/64, Dataset:ETTh2,  Metric: 192

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [26]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 192, Number of Patches: 64


### Training

In [27]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 8113
val 2689
test 2689
	iters: 100, epoch: 1 | loss: 0.4370615
	speed: 0.2612s/iter; left time: 1295.8034s
	iters: 200, epoch: 1 | loss: 0.7420160
	speed: 0.2620s/iter; left time: 1273.3663s
Epoch: 1 cost time: 65.11931347846985
Epoch: 1, Steps: 253 | Train Loss: 0.6650037 Vali Loss: 0.3667887 Test Loss: 0.4895372
Validation loss decreased (inf --> 0.366789).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.3104156
	speed: 1.4250s/iter; left time: 6709.0957s
	iters: 200, epoch: 2 | loss: 0.5599427
	speed: 0.1880s/iter; left time: 866.3137s
Epoch: 2 cost time: 62.50291919708252
Epoch: 2, Steps: 253 | Train Loss: 0.5431473 Vali Loss: 0.4427473 Test Loss: 0.8358685
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3144198
	speed: 1.3159s/iter; left time: 5862.2936s
	iters: 200, epoch: 3 | loss: 0.2996835
	speed: 0.1971s/iter; left time: 858.5016s
Epoch: 3 cost time: 63.59732675552368


Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [28]:
exp.test(setting)
torch.cuda.empty_cache()

test 2689
mse:0.391582190990448, mae:0.43561968207359314, rse:0.5017796158790588


---
## Trail 5: PatchTST/42, Dataset:ETTh2,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [29]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 336, Number of Patches: 42


### Training

In [30]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7969
val 2545
test 2545
	iters: 100, epoch: 1 | loss: 0.6213180
	speed: 0.2780s/iter; left time: 1356.8023s
	iters: 200, epoch: 1 | loss: 0.6660712
	speed: 0.2880s/iter; left time: 1376.8979s
Epoch: 1 cost time: 76.89646935462952
Epoch: 1, Steps: 249 | Train Loss: 0.7311775 Vali Loss: 0.4277453 Test Loss: 0.4471804
Validation loss decreased (inf --> 0.427745).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.6087614
	speed: 1.7781s/iter; left time: 8236.1108s
	iters: 200, epoch: 2 | loss: 0.4876281
	speed: 0.3339s/iter; left time: 1513.3912s
Epoch: 2 cost time: 87.2998058795929
Epoch: 2, Steps: 249 | Train Loss: 0.6134263 Vali Loss: 0.4421162 Test Loss: 0.4548156
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.3928181
	speed: 1.8880s/iter; left time: 8275.1433s
	iters: 200, epoch: 3 | loss: 0.5879908
	speed: 0.3840s/iter; left time: 1644.5920s
Epoch: 3 cost time: 92.79592943191528

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [31]:
exp.test(setting)
torch.cuda.empty_cache()

test 2545
mse:0.4471801817417145, mae:0.46865761280059814, rse:0.534436821937561


---
## Trail 6: PatchTST/64, Dataset:ETTh2,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [32]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 336, Number of Patches: 64


### Training

In [33]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7969
val 2545
test 2545
	iters: 100, epoch: 1 | loss: 0.8477201
	speed: 0.2082s/iter; left time: 1016.4142s
	iters: 200, epoch: 1 | loss: 0.7340878
	speed: 0.2189s/iter; left time: 1046.6976s
Epoch: 1 cost time: 53.93207311630249
Epoch: 1, Steps: 249 | Train Loss: 0.7225486 Vali Loss: 0.4625877 Test Loss: 0.5699974
Validation loss decreased (inf --> 0.462588).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4026355
	speed: 0.8470s/iter; left time: 3923.3568s
	iters: 200, epoch: 2 | loss: 0.5932161
	speed: 0.2441s/iter; left time: 1106.1355s
Epoch: 2 cost time: 58.89114546775818
Epoch: 2, Steps: 249 | Train Loss: 0.5968887 Vali Loss: 0.4708426 Test Loss: 0.6477016
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.5625603
	speed: 0.7970s/iter; left time: 3493.3154s
	iters: 200, epoch: 3 | loss: 0.4461977
	speed: 0.2389s/iter; left time: 1023.2267s
Epoch: 3 cost time: 57.3002939224243

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [34]:
exp.test(setting)
torch.cuda.empty_cache()

test 2545
mse:0.5775724649429321, mae:0.5237013101577759, rse:0.6073770523071289


---
## Trail 7: PatchTST/42, Dataset:ETTh2,  Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [37]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh2, Prediction Length: 720, Number of Patches: 42


### Training

In [38]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7585
val 2161
test 2161
	iters: 100, epoch: 1 | loss: 0.4212859
	speed: 0.2342s/iter; left time: 1087.0385s
	iters: 200, epoch: 1 | loss: 0.7312227
	speed: 0.2759s/iter; left time: 1253.0478s
Epoch: 1 cost time: 59.01574182510376
Epoch: 1, Steps: 237 | Train Loss: 0.8387095 Vali Loss: 0.6917935 Test Loss: 0.7220803
Validation loss decreased (inf --> 0.691794).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.4868017
	speed: 1.2760s/iter; left time: 5619.3527s
	iters: 200, epoch: 2 | loss: 0.7825486
	speed: 0.2471s/iter; left time: 1063.4961s
Epoch: 2 cost time: 61.913822650909424
Epoch: 2, Steps: 237 | Train Loss: 0.7230844 Vali Loss: 0.8397724 Test Loss: 0.7207495
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.5912184
	speed: 1.3189s/iter; left time: 5495.8989s
	iters: 200, epoch: 3 | loss: 0.6378720
	speed: 0.2770s/iter; left time: 1126.4691s
Epoch: 3 cost time: 67.501143932342

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [39]:
exp.test(setting)
torch.cuda.empty_cache()

test 2161
mse:0.7220802307128906, mae:0.6122045516967773, rse:0.6793708205223083


---
## Trail 8: PatchTST/64, Dataset:ETTh2,  Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [41]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTh1, Prediction Length: 720, Number of Patches: 64


### Training

In [42]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 7585
val 2161
test 2161
	iters: 100, epoch: 1 | loss: 0.6905423
	speed: 0.2913s/iter; left time: 1351.8219s
	iters: 200, epoch: 1 | loss: 0.5533327
	speed: 0.2929s/iter; left time: 1330.0202s
Epoch: 1 cost time: 68.22403383255005
Epoch: 1, Steps: 237 | Train Loss: 0.6730704 Vali Loss: 1.2304925 Test Loss: 0.5032592
Validation loss decreased (inf --> 1.230492).  Saving model ...
Updating learning rate to 0.005
	iters: 100, epoch: 2 | loss: 0.5409788
	speed: 1.3981s/iter; left time: 6157.3664s
	iters: 200, epoch: 2 | loss: 0.5539945
	speed: 0.2920s/iter; left time: 1256.7084s
Epoch: 2 cost time: 71.20020580291748
Epoch: 2, Steps: 237 | Train Loss: 0.5450662 Vali Loss: 1.4113623 Test Loss: 0.5585139
EarlyStopping counter: 1 out of 10
Updating learning rate to 0.0025
	iters: 100, epoch: 3 | loss: 0.4570258
	speed: 1.4370s/iter; left time: 5987.9974s
	iters: 200, epoch: 3 | loss: 0.5009385
	speed: 0.2660s/iter; left time: 1081.8061s
Epoch: 3 cost time: 64.5997910499572

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=16, bias=True)
      (dropout): Dropout(p=0.3, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=16, out_features=16, bias=True)
              (W_K): Linear(in_features=16, out_features=16, bias=True)
              (W_V): Linear(in_features=16, out_features=16, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=16, out_features=16, bias=True)
                (1): Dropout(p=0.3, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.3, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
              (1)

### Testing

In [43]:
exp.test(setting)
torch.cuda.empty_cache()

test 2161
mse:0.5032591223716736, mae:0.5134444236755371, rse:0.6794742345809937


### Compare our results with paper results

In [None]:
from IPython.display import Image
Image(filename=r"./Images/ETT1.png")

#### **Experiment Results**:
Comaparing my results with the paper resulted highlited in the image above.

| PatchTST/42 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTh2 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTh2 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTh2 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTh2 | 720 | 0.470274418592453| 0.4867483675479889 |


| PatchTST/64 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTh2 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTh2 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTh2 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTh2 | 720 | 0.470274418592453| 0.4867483675479889 |

---
# Working on ETTm1 Dataset
---

## Trail 1: PatchTST/42, Dataset:ETTm1,  Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [75]:
args.data_path = 'ETTm1.csv' # data file
args.data = 'ETTm1'  # data
args.pred_len = 96 # prediction sequence length
args.n_heads = 16 
args.d_model = 128 
args.d_ff = 256 
args.dropout = 0.2
args.fc_dropout = 0.2
args.patience = 10#20
args.pct_start = 0.4
args.patch_len = 16#60
args.num_patch = 42  # The number of input patches
args.lradj = 'TST'

print(f"Dataset: {args.data}, Prediction Length : {args.pred_len}") 
# print(args)

Dataset: ETTm1, Prediction Length : 96


### Training

In [45]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 34129
val 11425
test 11425
	iters: 100, epoch: 1 | loss: 0.3383717
	speed: 0.1908s/iter; left time: 4049.1192s
	iters: 200, epoch: 1 | loss: 0.3019485
	speed: 0.1881s/iter; left time: 3973.4323s
	iters: 300, epoch: 1 | loss: 0.4252478
	speed: 0.1949s/iter; left time: 4096.2786s
	iters: 400, epoch: 1 | loss: 0.2874214
	speed: 0.1759s/iter; left time: 3680.8659s
	iters: 500, epoch: 1 | loss: 0.3519585
	speed: 0.1970s/iter; left time: 4102.5502s
	iters: 600, epoch: 1 | loss: 0.2871589
	speed: 0.2020s/iter; left time: 4186.2210s
	iters: 700, epoch: 1 | loss: 0.3036229
	speed: 0.2010s/iter; left time: 4144.4639s
	iters: 800, epoch: 1 | loss: 0.2620423
	speed: 0.1839s/iter; left time: 3774.6934s
	iters: 900, epoch: 1 | loss: 0.3427806
	speed: 0.1971s/iter; left time: 4025.0566s
	iters: 1000, epoch: 1 | loss: 0.3516700
	speed: 0.1899s/iter; left time: 3859.4710s
Epoch: 1 cost time: 205.17846369743347
Epoch: 1, Steps: 1066 | Train Loss: 0.3263496 Vali Loss: 0.4278865 Test

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=60, out_features=128, bias=True)
      (dropout): Dropout(p=0.2, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=128, out_features=128, bias=True)
              (W_K): Linear(in_features=128, out_features=128, bias=True)
              (W_V): Linear(in_features=128, out_features=128, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=128, out_features=128, bias=True)
                (1): Dropout(p=0.2, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.2, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
        

### Testing

In [46]:
exp.test(setting)
torch.cuda.empty_cache()

test 11425
mse:0.34566619992256165, mae:0.4012136459350586, rse:0.5594445466995239


---
## Trail 2: PatchTST/64, Dataset:ETTm1 , Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [47]:
args.pred_len = 96 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTm1, Prediction Length: 96, Number of Patches: 64


### Training

In [48]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 34129
val 11425
test 11425
	iters: 100, epoch: 1 | loss: 0.3284838
	speed: 0.2034s/iter; left time: 4316.0460s
	iters: 200, epoch: 1 | loss: 0.3892616
	speed: 0.2299s/iter; left time: 4855.1031s
	iters: 300, epoch: 1 | loss: 0.3343551
	speed: 0.2181s/iter; left time: 4585.2195s
	iters: 400, epoch: 1 | loss: 0.3546901
	speed: 0.2370s/iter; left time: 4957.5456s
	iters: 500, epoch: 1 | loss: 0.2436483
	speed: 0.2189s/iter; left time: 4557.9345s
	iters: 600, epoch: 1 | loss: 0.2919290
	speed: 0.2121s/iter; left time: 4395.6333s
	iters: 700, epoch: 1 | loss: 0.2979367
	speed: 0.2149s/iter; left time: 4431.9062s
	iters: 800, epoch: 1 | loss: 0.2730506
	speed: 0.2341s/iter; left time: 4804.2584s
	iters: 900, epoch: 1 | loss: 0.3008355
	speed: 0.2319s/iter; left time: 4734.7312s
	iters: 1000, epoch: 1 | loss: 0.3204103
	speed: 0.2390s/iter; left time: 4857.3499s
Epoch: 1 cost time: 239.6319296360016
Epoch: 1, Steps: 1066 | Train Loss: 0.3245013 Vali Loss: 0.4370958 Test 

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=60, out_features=128, bias=True)
      (dropout): Dropout(p=0.2, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=128, out_features=128, bias=True)
              (W_K): Linear(in_features=128, out_features=128, bias=True)
              (W_V): Linear(in_features=128, out_features=128, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=128, out_features=128, bias=True)
                (1): Dropout(p=0.2, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.2, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
        

### Testing

In [49]:
exp.test(setting)
torch.cuda.empty_cache()

test 11425
mse:0.3271180987358093, mae:0.373868852853775, rse:0.5442279577255249


---
## Trail 3: PatchTST/42, Dataset:ETTm1,  Metric: 192

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [70]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTm1, Prediction Length: 192, Number of Patches: 42


### Training

In [71]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 34033
val 11329
test 11329
	iters: 100, epoch: 1 | loss: 0.3963512
	speed: 0.3154s/iter; left time: 6673.4125s
	iters: 200, epoch: 1 | loss: 0.3961364
	speed: 0.3210s/iter; left time: 6760.0896s
	iters: 300, epoch: 1 | loss: 0.3892810
	speed: 0.3260s/iter; left time: 6833.6930s
	iters: 400, epoch: 1 | loss: 0.3056287
	speed: 0.3400s/iter; left time: 7092.5647s
	iters: 500, epoch: 1 | loss: 0.4561167
	speed: 0.3150s/iter; left time: 6539.7569s
	iters: 600, epoch: 1 | loss: 0.4037109
	speed: 0.3370s/iter; left time: 6963.0674s
	iters: 700, epoch: 1 | loss: 0.2903167
	speed: 0.3200s/iter; left time: 6578.7339s
	iters: 800, epoch: 1 | loss: 0.3212291
	speed: 0.3251s/iter; left time: 6651.5133s
	iters: 900, epoch: 1 | loss: 0.3361028
	speed: 0.2879s/iter; left time: 5862.6059s
	iters: 1000, epoch: 1 | loss: 0.3125722
	speed: 0.3060s/iter; left time: 6200.1842s
Epoch: 1 cost time: 341.04079818725586
Epoch: 1, Steps: 1063 | Train Loss: 0.3533901 Vali Loss: 0.5473794 Test

Model(
  (model): PatchTST_backbone(
    (backbone): TSTiEncoder(
      (W_P): Linear(in_features=24, out_features=128, bias=True)
      (dropout): Dropout(p=0.2, inplace=False)
      (encoder): TSTEncoder(
        (layers): ModuleList(
          (0-2): 3 x TSTEncoderLayer(
            (self_attn): _MultiheadAttention(
              (W_Q): Linear(in_features=128, out_features=128, bias=True)
              (W_K): Linear(in_features=128, out_features=128, bias=True)
              (W_V): Linear(in_features=128, out_features=128, bias=True)
              (sdp_attn): _ScaledDotProductAttention(
                (attn_dropout): Dropout(p=0.0, inplace=False)
              )
              (to_out): Sequential(
                (0): Linear(in_features=128, out_features=128, bias=True)
                (1): Dropout(p=0.2, inplace=False)
              )
            )
            (dropout_attn): Dropout(p=0.2, inplace=False)
            (norm_attn): Sequential(
              (0): Transpose()
        

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

test 11329
mse:0.4130229949951172, mae:0.438975065946579, rse:0.6117640733718872


---
## Trail 4: PatchTST/64, Dataset:ETTm1,  Metric: 192

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [76]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

Dataset: ETTm1, Prediction Length: 192, Number of Patches: 64


### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

Use GPU: cuda:0
train 34033
val 11329
test 11329
	iters: 100, epoch: 1 | loss: 0.4179065
	speed: 0.3510s/iter; left time: 7427.1056s
	iters: 200, epoch: 1 | loss: 0.4193902
	speed: 0.3421s/iter; left time: 7204.8105s
	iters: 300, epoch: 1 | loss: 0.3345259
	speed: 0.3180s/iter; left time: 6665.0604s
	iters: 400, epoch: 1 | loss: 0.4020755
	speed: 0.3139s/iter; left time: 6548.3559s
	iters: 500, epoch: 1 | loss: 0.3505193
	speed: 0.3491s/iter; left time: 7247.9694s
	iters: 600, epoch: 1 | loss: 0.2714881
	speed: 0.3179s/iter; left time: 6568.8266s
	iters: 700, epoch: 1 | loss: 0.3446555
	speed: 0.3112s/iter; left time: 6397.9991s
	iters: 800, epoch: 1 | loss: 0.3572136
	speed: 0.3208s/iter; left time: 6563.3261s
	iters: 900, epoch: 1 | loss: 0.3371011
	speed: 0.3340s/iter; left time: 6801.1653s
	iters: 1000, epoch: 1 | loss: 0.3129145
	speed: 0.3190s/iter; left time: 6463.6882s
Epoch: 1 cost time: 349.4956429004669
Epoch: 1, Steps: 1063 | Train Loss: 0.3566981 Vali Loss: 0.5289810 Test 

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 5: PatchTST/42, Dataset:ETTm1,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 6: PatchTST/64, Dataset:ETTm1,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 7: PatchTST/42, Dataset:ETTm1,  Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 8: PatchTST/64, Dataset:ETTm1,  Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

### Compare our results with paper results

In [None]:
from IPython.display import Image
Image(filename=r"./Images/ETT1.png")

#### **Experiment Results**:
Comaparing my results with the paper resulted highlited in the image above.

| PatchTST/42 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTm1 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTm1 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTm1 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTm1 | 720 | 0.470274418592453| 0.4867483675479889 |


| PatchTST/64 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTm1 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTm1 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTm1 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTm1 | 720 | 0.470274418592453| 0.4867483675479889 |

---
# Working on ETTm2 Dataset
---

## Trail 1: PatchTST/42, Dataset:ETTm2,  Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.data_path = 'ETTm2.csv' # data file
args.data = 'ETTm2'  # data
args.pred_len = 96 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 2: PatchTST/64, Dataset:ETTm2,  Metric: 96
### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary

In [None]:
args.pred_len = 96 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 3: PatchTST/42, Dataset:ETTm2,  Metric: 192

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 4: PatchTST/64, Dataset:ETTm2,  Metric: 192

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 192 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 5: PatchTST/42, Dataset:ETTm2,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 6: PatchTST/64, Dataset:ETTm2,  Metric: 336

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 336 # prediction sequence length
args.num_patch = 64  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 7: PatchTST/42, Dataset:ETTm2,  Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

---
## Trail 8: PatchTST/64, Dataset:ETTm2,  Metric: 720

### Set hyperparameters
Set some parameters (Args) for the our experiment like dictionary


In [None]:
args.pred_len = 720 # prediction sequence length
args.num_patch = 42  # The number of input patches
print(f"Dataset: {args.data}, Prediction Length: {args.pred_len}, Number of Patches: {args.num_patch}") 
# print(args)

### Training

In [None]:
Exp = Exp_Main
setting=f'PatchTST_train_on_{args.data}_{args.pred_len}_{args.num_patch}'
# set experiments
exp = Exp(args)
exp.train(setting)

### Testing

In [None]:
exp.test(setting)
torch.cuda.empty_cache()

### Compare our results with paper results

In [None]:
from IPython.display import Image
Image(filename=r"./Images/ETT1.png")

#### **Experiment Results**:
Comaparing my results with the paper resulted highlited in the image above.

| PatchTST/42 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTm2 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTm2 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTm2 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTm2 | 720 | 0.470274418592453| 0.4867483675479889 |


| PatchTST/64 | Dataset | Seq_len | MSE | MAE |
|---|---|---|---|---|
|  | ETTm2 | 96 |  0.37019482254981995| 0.3915349543094635  |
|  | ETTm2 | 192 |  0.40499380230903625 | 0.4138428270816803 |
|  | ETTm2 | 336 | 0.4337225556373596 | 0.434622198343277  |
|  | ETTm2 | 720 | 0.470274418592453| 0.4867483675479889 |

---
### Conclusion
---

The training process is progressing well and the `PatchTST` model is being optimized in an effective manner. 
- The loss is steadily decreasing over the epochs, which indicates the model is learning and improving. 
- The validation loss is also decreasing at each epoch, showing the model is generalizing well.

**The key positive signs I see are:**

- Decreasing loss and validation loss
- Learning rate decay schedule


---