# Transformer_having_features_for_TimeSeries_Forecasting
* this code was made by chat-GPT

For time series forecasting, especially when predicting future electricity consumption based on multiple features, deep learning models can significantly benefit from proper feature vectorization. This process involves transforming your raw data into a format that the neural network can effectively learn from. Given your scenario with 10 features, here are several strategies to vectorize these features for deep learning models:

### 1. **Feature Scaling**

First and foremost, normalize or standardize your features. This is crucial for models like neural networks to converge quickly. You can use Min-Max scaling to normalize the data or Z-score normalization to standardize it.

- **Normalization (Min-Max Scaling)**: Scales the features to a fixed range, usually [0, 1].
- **Standardization (Z-score normalization)**: Scales the features so they have the properties of a standard normal distribution with a mean of 0 and a standard deviation of 1.

### 2. **Sequence Windowing**

For time series data, it's important to structure your input data into sequences that the model can learn from. This is often done by creating "windows" of past observations to predict future values.

- **Fixed Windowing**: Create fixed-size input sequences (windows) of your 10 features. For instance, use the past 24 hours of data (assuming hourly sampling) to predict the next hour's electricity consumption.
- **Sliding Windows**: Similar to fixed windowing but the window slides by a certain step. For example, you might slide by one hour at a time, creating overlapping windows of data.

### 3. **Time Embeddings**

If your data includes explicit time stamps (e.g., hour of the day, day of the week), you can convert these into cyclical features using sine and cosine transformations. This helps the model capture time-based patterns like daily or weekly cycles.

### 4. **Feature Embeddings for Categorical Data**

If any of your 10 features are categorical (e.g., type of day: holiday/weekend/workday), consider using embeddings to convert these categories into continuous vectors. This can be more effective than one-hot encoding for models to capture the nuances of categorical data.

### 5. **Lag Features**

Create features that are lagged versions of the existing features. For instance, the electricity consumption from the previous day (or the same hour the previous day) can be a powerful feature for predicting future consumption.

### 6. **Rolling Window Statistics**

Generate statistical features based on rolling windows, such as the mean, median, variance, or sum of the past N hours/days. These features can capture trends and seasonality in the data.

### 7. **Differencing**

For non-stationary time series data, differencing can help stabilize the mean of the time series by removing changes in the level of a time series, and thus eliminate (or reduce) trend and seasonality.

### Incorporating into a Deep Learning Model

Once you've vectorized your features using the strategies above, you can feed them into various types of deep learning models suitable for time series forecasting, such as:

- **Recurrent Neural Networks (RNNs)**: Good for capturing temporal dependencies.
- **Long Short-Term Memory (LSTM) networks**: A special kind of RNNs, effective in learning long-term dependencies.
- **Gated Recurrent Units (GRUs)**: Similar to LSTMs, but simpler and often faster to train.
- **Convolutional Neural Networks (CNNs)**: Not just for image data; 1D CNNs can be effective for sequence data.
- **Transformer models**: Though originally designed for natural language processing, transformers can be adapted for time series forecasting by treating the time series data as a sequence.

Remember, the effectiveness of each vectorization technique and model architecture can vary based on the specifics of your dataset and the nature of the forecasting problem. It's often beneficial to experiment with different approaches and combinations thereof.

## **tips: When do we need to normailze target variable?**
The need to normalize a target variable in time series (or any other type of data) largely depends on its characteristics and the modeling approach you're using. Here are types of target variables that often require normalization:

1. **Continuous Variables with Large Range**: If your target variable is a continuous variable that spans a large range of values, normalization can help to ensure that the optimization algorithm works efficiently. This is especially true for deep learning models, where having targets on a similar scale can significantly impact the convergence rate and stability of the learning process.

2. **Skewed Variables**: For target variables that are highly skewed, normalization (or even log transformation, which is a form of normalization) can help make the distribution more symmetric, improving model performance by making it easier for the model to learn the underlying patterns.

3. **Variables with Different Units and Scales**: In the context of multivariate time series forecasting, where you might be predicting multiple targets, normalization ensures that all variables contribute equally to the error term. Without normalization, a variable with a large scale can dominate the gradient updates, potentially leading to suboptimal performance.

4. **High Magnitude Variables**: Variables with values that have a high magnitude can lead to numerical instability in deep learning models due to the way floating-point arithmetic is handled in computers. Normalizing these variables to a lower range can help prevent issues like overflow, underflow, or vanishing/exploding gradients.

### When You Might Not Need to Normalize:
- **Binary or Categorical Targets**: For classification tasks where the target variable is binary or categorical (after being one-hot encoded or otherwise transformed), normalization of the target variable itself is not typically necessary. The focus would instead be on the features.

- **Targets with Narrow Range**: If the target variable inherently falls within a narrow range and you're using a model that's less sensitive to the scale of the input (like decision trees or certain ensemble methods), normalization might not be necessary.

- **Count Data with Low Variance**: If you're dealing with count data that doesn't vary widely, normalization might not offer significant benefits. However, for highly skewed count data, transformations like log scaling can still be beneficial.

It’s important to consider the nature of your target variable and the requirements of your modeling approach when deciding on normalization. Also, the decision to normalize should be guided by experimentation and validation on your specific dataset, as the benefits can vary depending on the context and the peculiarities of the data at hand.

## Step 1: Data Preparation
First, you need to prepare your dataset. This includes loading your data, normalizing it, and creating input sequences and their corresponding labels.

### Generate Sample Data
This data will consists of 10 features, with each row representing an hourly record.

In [22]:
import numpy as np
import pandas as pd

def generate_sample_data(num_records=1000):
    # Generate random data for 10 features
    data = np.random.rand(num_records, 10)

    # Assume the last feature is related to electricity consumption
    # and use it to create a target variable
    # The actual consumption is some combination of the features plus noise
    consumption = data[:, -1] * 0.5 + np.random.normal(0, 0.02, size=num_records)

    return pd.DataFrame(data, columns=[f'feature{i}' for i in range(1, 11)]), consumption

features, consumption = generate_sample_data()

In [23]:
features

Unnamed: 0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,feature9,feature10
0,0.982371,0.159263,0.183412,0.470824,0.870993,0.562760,0.508709,0.086757,0.961986,0.347116
1,0.251807,0.411744,0.414182,0.458533,0.688538,0.867947,0.057313,0.490769,0.413978,0.010880
2,0.816489,0.104313,0.114748,0.048298,0.200422,0.176816,0.347866,0.659270,0.054294,0.860276
3,0.326579,0.606607,0.012005,0.365808,0.002051,0.165458,0.447850,0.535087,0.152929,0.745216
4,0.189004,0.115845,0.333425,0.551429,0.988902,0.418716,0.425996,0.059333,0.765820,0.181783
...,...,...,...,...,...,...,...,...,...,...
995,0.640417,0.585508,0.406552,0.329525,0.179808,0.060143,0.634498,0.779200,0.857763,0.724308
996,0.896951,0.982807,0.961668,0.592230,0.224200,0.106355,0.993925,0.308390,0.033971,0.391046
997,0.748100,0.890800,0.477564,0.294029,0.778404,0.503768,0.086030,0.887750,0.655807,0.036812
998,0.009057,0.087722,0.821955,0.481308,0.335301,0.249293,0.506858,0.083040,0.617180,0.380527


In [24]:
consumption[:20]

array([0.20434603, 0.01926492, 0.41995442, 0.38108821, 0.07795549,
       0.30348063, 0.25940669, 0.01015272, 0.38136406, 0.16268581,
       0.04311213, 0.19002552, 0.20952929, 0.28378454, 0.09681489,
       0.28113735, 0.0799279 , 0.30924109, 0.34146173, 0.19844026])

### Data Preprocessing
For LSTM models, we need to format our data into sequences. We'll also split the data into training and testing sets.

In [25]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import math

In [26]:
# suqenceデータの作成
# twを5にすると、3次元のデータ構造で、x方向に10個(特徴量数), y方向に5個(時系列数), z方向に995個（len(data_normalized) - 5)の
# データが作られる。これは、LSTM用に、各yに対して5時点分のsequenceデータを用意している作業

def create_sequences(features, targets, time_steps=1):
    Xs, ys = [], []
    for i in range(len(features) - time_steps):
        Xs.append(features[i:(i + time_steps)])
        ys.append(targets[i + time_steps])
    return np.array(Xs), np.array(ys)

# Normalize data
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)

# Create sequences
time_steps = 5
X, y = create_sequences(features_scaled, consumption, time_steps)

In [27]:
X.shape

(995, 5, 10)

In [28]:
y.shape

(995,)

In [29]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to Pytorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## Step2: Transformer Model Setup
Unlike traditional Transformers used in NLP, for time series forecasting, we focus more on **the encoder part**. We'll simplify the implementation to make it easier to understand.
<br><br>
Fist, difine a **Positional Encoding layer** to add information about the position of each time step in the input sequence, which helps the model distingish the order of data points.

In [30]:
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0).transpose(0, 1))

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

Next, we define the Transformer model. We'll simplify the architecture to inculde a single Transformer Encoder layer.

In [31]:
class TransformerModel(nn.Module):
    def __init__(self, input_size=10, num_layers=1, nhead=2, d_model=128, dim_feedforward=512, dropout=0.1):
        super().__init__()
        self.pos_encoder = PositionalEncoding(d_model, max_len=5000)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead,
                                                        dim_feedforward=dim_feedforward, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
        self.encoder = nn.Linear(input_size, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, 1)
        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()

    def forward(self, src):
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src)
        output = self.decoder(output)
        return output.squeeze(-1)


# class TransformerModel(nn.Module):
#     def __init__(self, input_size=10, num_layers=1, nhead=2, d_model=512, dim_feedforward=2048, dropout=0.1):
#         super(TransformerModel, self).__init__()
#         self.model_type = 'Transformer'
#         self.pos_encoder = PositionalEncoding(d_model, max_len=5000)
#         self.encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead,
#                                                         dim_feedforward=dim_feedforward, dropout=dropout)
#         self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
#         self.encoder = nn.Linear(input_size, d_model)
#         self.d_model = d_model
#         self.decoder = nn.Linear(d_model, 1)
#         self.init_weights()

#     def generate_attention_mask(self, sz, device):
#         mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
#         mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
#         return mask.to(device)

#     def init_weights(self):
#         initrange = 0.1
#         self.encoder.weight.data.uniform_(-initrange, initrange)
#         self.decoder.weight.data.uniform_(-initrange, initrange)
#         self.decoder.bias.data.zero_()

#     def forward(self, src):
#         src = self.encoder(src) * math.sqrt(self.d_model)
#         src = self.pos_encoder(src)

#         # Generate attention mask dynamically based on the batch size and sequence length
#         batch_size, seq_len, _ = src.size()
#         src_mask = self.generate_attention_mask(seq_len, src.device)

#         output = self.transformer_encoder(src, src_mask)
#         output = self.decoder(output)
#         return output

## Step3: Initializing the Model
Initialize the Transformer model with appropriate parameters. Given tha we're working with time series data and not text, you might need to adjust the parameters like `d_model` and `nhead` depending on your dataset's characteristics.

In [32]:
# Adjusti the parameters according to your dataset and model complexity
input_size = 10 # Number of features
d_model = 512 # Embedding dimentions
nhead = 2 # Number of heads in the multi-head attention models
num_layers = 1 # Number of Transformer blocks
dim_feedforward = 2048 # Dimension of the feedforward networkd model in nn.TransformerEncoder
dropout = 0.1 # Dropout rate

model = TransformerModel(input_size=input_size,
                         num_layers=num_layers,
                         nhead=nhead,
                         d_model=d_model,
                         dim_feedforward=dim_feedforward,
                         dropout=dropout)

# Assuming you have aGPU, you would want to move your model to GPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)




## Step4: Training the Model
Define your training loop, including loss function and optimizer. Training a Transformer model follows the same PyTorch training loop pattern as other models.
<br><br>
Remember to create a source mask for the Transformer, as it uses self-attention mechanisms that need to know where padding or future tokens are:

In [33]:
epochs = 5

for epoch in range(epochs):
    model.train()  # Ensure the model is in training mode
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets.unsqueeze(-1))
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch 1, Loss: 0.1722
Epoch 2, Loss: 0.1195
Epoch 3, Loss: 0.1595
Epoch 4, Loss: 0.0796
Epoch 5, Loss: 0.0730


## Step5: Evaluating the Model
After training, you should evaluate the model's performance on the test set. This basic example doesn't include evaluation steps, but you would typically predict on the test set and compare it against the true values using a suitable metric (e.g., MSE for regression tasks).

In [34]:
# Ensure the model is in evaluation mode
model.eval()

test_loss = 0.0
predictions = []
targets_list = []

# No gradient updates needed for testing
with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        outputs = model(inputs)

        # Calculate loss
        loss = criterion(outputs, targets.unsqueeze(-1))
        test_loss += loss.item() * inputs.size(0)

        # Store predictions and targets to evaluate further metrics (if necessary)
        predictions.extend(outputs.view(-1).cpu().numpy())
        targets_list.extend(targets.view(-1).cpu().numpy())

# Calculate average loss over all test data
test_loss /= len(test_loader.dataset)

print(f"Test Loss: {test_loss:.4f}")

# Optionally, calculate additional metrics like MAE, RMSE, etc., using predictions and targets_list


Test Loss: 0.0213


  return F.mse_loss(input, target, reduction=self.reduction)


In [None]:
### erroro code
# with torch.no_grad():
#     predictions = []
#     for inputs, _ in test_loader:
#         predictions.append(model(inputs).numpy())

# # flatten the list of predictions
# predictions = np.concatenate(predictions, axis=0)

# loss = criterion(torch.tensor(predictions), torch.tensor(y_test))
# print(f"Test Loss: {loss.item():.4f}")

### add validation step in Training Loop

In [35]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import math

# Assuming the generate_sample_data and other functions remain unchanged

features, consumption = generate_sample_data()
scaler = MinMaxScaler()
features_scaled = scaler.fit_transform(features)
time_steps = 5
X, y = create_sequences(features_scaled, consumption, time_steps)

# Adjusted Split: First split into train+val and test, then split train+val into train and val
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42) # 0.25 * 0.8 = 0.2

X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Assuming TransformerModel class remains unchanged

model = TransformerModel(input_size=10, num_layers=1, nhead=2, d_model=128, dim_feedforward=512, dropout=0.1)
model.to(device)

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

epochs = 5

for epoch in range(epochs):
    model.train()
    train_loss = 0.0
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets.unsqueeze(-1))
        loss.backward()
        optimizer.step()
        train_loss += loss.item() * inputs.size(0)
    train_loss /= len(train_loader.dataset)

    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for inputs, targets in val_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets.unsqueeze(-1))
            val_loss += loss.item() * inputs.size(0)
    val_loss /= len(val_loader.dataset)

    print(f"Epoch {epoch+1}, Training Loss: {train_loss:.4f}, Validation Loss: {val_loss:.4f}")


  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)
  return F.mse_loss(input, target, reduction=self.reduction)


Epoch 1, Training Loss: 0.4233, Validation Loss: 0.2043
Epoch 2, Training Loss: 0.1078, Validation Loss: 0.0365
Epoch 3, Training Loss: 0.0417, Validation Loss: 0.0239
Epoch 4, Training Loss: 0.0347, Validation Loss: 0.0280
Epoch 5, Training Loss: 0.0309, Validation Loss: 0.0204


**Explanation:**
* **Dataset Splitting**: Adjusted to create a validation set from the original training data, ensuring you have separate training, validation, and test datasets.
* **Validation Loop**: After each training epoch, the model is evaluated on the validation set. The **model.eval()** call disables dropout and batch normalization during this evaluation phase, and torch.no_grad() ensures that gradients are not computed, reducing memory usage and speeding up computation.
* **Reporting**: The average training and validation losses are reported after each epoch, allowing you to monitor the model's performance and overfitting.
<br><br>
Remember, this adjustment uses part of your original training data for validation. If you have a separate validation dataset, you can skip the additional splitting and directly use your data.

In [36]:
### test loop

# Ensure the model is in evaluation mode
model.eval()

test_loss = 0.0
predictions = []
targets_list = []

# No gradient updates needed for testing
with torch.no_grad():
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        outputs = model(inputs)

        # Calculate loss
        loss = criterion(outputs, targets.unsqueeze(-1))
        test_loss += loss.item() * inputs.size(0)

        # Store predictions and targets to evaluate further metrics (if necessary)
        predictions.extend(outputs.view(-1).cpu().numpy())
        targets_list.extend(targets.view(-1).cpu().numpy())

# Calculate average loss over all test data
test_loss /= len(test_loader.dataset)

print(f"Test Loss: {test_loss:.4f}")

# Optionally, calculate additional metrics like MAE, RMSE, etc., using predictions and targets_list


Test Loss: 0.0251


# Transformer by gpt4

1. **Generate Sample Data**: Create synthetic electricity consumption data with 10 features as specified.
2. **Preprocessing**: Prepare the data for training, including normalization.
3. **PyTorch Dataset and DataLoader**: Implement custom dataset and dataloader for batching.
4. **Model Definition**: Define a Transformer model suitable for time series forecasting.
5. **Training and Validation**: Set up the training loop and validate the model on test data.

## Step 1 & 2: Generate Sample Data and Preprocessing

In [15]:
import pandas as pd
import numpy as np

# import warnings
# warnings.filterwarnings("ignore")

np.random.seed(42)  # For reproducibility

# Generate a DataFrame with datetime information
num_hours = 365 * 24  # A year's worth of hourly data
date_rng = pd.date_range(start='1/1/2020', end='31/12/2020', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['weekday'] = df['date'].dt.weekday
df['hour'] = df['date'].dt.hour
df['season'] = df['date'].dt.month % 12 // 3 + 1

# Generate synthetic features and target variable
for i in range(7):  # Additional 7 features
    df[f'feature_{i}'] = np.random.rand(len(df))
df['electricity_consumption'] = np.random.rand(len(df)) * 100  # Target variable

  date_rng = pd.date_range(start='1/1/2020', end='31/12/2020', freq='H')


In [16]:
df.head()

Unnamed: 0,date,weekday,hour,season,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,electricity_consumption
0,2020-01-01 00:00:00,2,0,1,0.37454,0.671368,0.40998,0.421576,0.137686,0.120749,0.616654,1.923384
1,2020-01-01 01:00:00,2,1,1,0.950714,0.523158,0.838483,0.280547,0.260339,0.520433,0.003229,47.550482
2,2020-01-01 02:00:00,2,2,1,0.731994,0.898639,0.185176,0.895044,0.48954,0.095159,0.792586,26.352564
3,2020-01-01 03:00:00,2,3,1,0.598658,0.164393,0.554842,0.332239,0.061339,0.256357,0.243121,53.995885
4,2020-01-01 04:00:00,2,4,1,0.156019,0.804109,0.722233,0.578596,0.095686,0.451709,0.299217,17.865769


## Step 3: PyTorch Dataset and DataLoader

In [17]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from torch.utils.data import Dataset
import torch
import torch.nn as nn

# Normalize features
scaler = MinMaxScaler()
df.iloc[:,1:-1] = scaler.fit_transform(df.iloc[:,1:-1])

# Function to create sequences
def create_sequences(input_data, target_data, input_steps, forecast_steps):
    X, y = [], []
    for i in range(len(input_data) - input_steps - forecast_steps):
        X.append(input_data.iloc[i:(i+input_steps)].values)
        y.append(target_data.iloc[i+input_steps:i+input_steps+forecast_steps].values)
    return np.array(X), np.array(y)

encoder_length = 168  # 7 days of hourly records
forecast_length = 24  # Predicting the next 24 hours

# Creating sequences
X, y = create_sequences(df.iloc[:,1:-1], df[['electricity_consumption']], encoder_length, forecast_length)

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [18]:
class ElectricityDataset(Dataset):
    def __init__(self, features, targets):
        self.features = features
        self.targets = targets

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):
        return torch.tensor(self.features[idx], dtype=torch.float32), torch.tensor(self.targets[idx], dtype=torch.float32)

## Step 4: Transformer Model Definition
Given the complexity, we'll focus on a simplified Transformer structure, emphasizing positional encoding for handling sequential data.

In [19]:
class TransformerModel(nn.Module):
    def __init__(self, input_dim, model_dim, num_heads, num_encoder_layers, output_dim):
        super(TransformerModel, self).__init__()
        self.input_projection = nn.Linear(input_dim, model_dim)
        self.positional_encoder = PositionalEncoding(model_dim)
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=model_dim, nhead=num_heads)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_encoder_layers)
        # Ensure the output dimension matches target dimension [batch_size, forecast_length, 1]
        self.fc_out = nn.Linear(model_dim, 1)  # Output dim is 1 per time step

    def forward(self, src):
        src = self.input_projection(src)  # Project input features to model_dim
        src = self.positional_encoder(src)
        encoded_src = self.transformer_encoder(src)
        output = self.fc_out(encoded_src)  # Apply linear transformation
        # Select the last forecast_length steps from output
        output = output[-forecast_length:, :, :]
        return output.permute(1, 0, 2)  # Adjust output to match [batch_size, forecast_length, features]


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x


## Step 5: Training

In [20]:
from torch.optim import Adam
from sklearn.model_selection import train_test_split

# Split data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_dataset = ElectricityDataset(X_train, y_train)
val_dataset = ElectricityDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Model instantiation
input_dim = 10  # Number of features
model_dim = 512
num_heads = 8
num_encoder_layers = 3
num_decoder_layers = 3
output_dim = 24  # Predicting next 24 hours of consumption

model = TransformerModel(input_dim, model_dim, num_heads, num_encoder_layers, output_dim)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = Adam(model.parameters(), lr=0.001)

# Training loop
model.train()  # Set model to training mode
for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    data = data.permute(1, 0, 2)  # [seq_len, batch, features]
    output = model(data)
    # Here, ensure output and target shapes are aligned
    output = output[:, -forecast_length:, :]  # Focus on the last 'forecast_length' outputs
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    print(f"Train Epoch: 1 [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}")






### about error at training loop


In [None]:
# error code
def forward(self, src):
    src = self.input_projection(src)  # Project input features to model_dim
    src = self.positional_encoder(src)
    encoded_src = self.transformer_encoder(src)
    output = self.fc_out(encoded_src)  # Apply linear transformation
    # Ensure output is reshaped or sliced correctly if necessary
    return output.permute(1, 0, 2)  # Adjust output to match [batch_size, seq_len, features]


# Training loop
model.train()  # Set model to training mode
for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    data = data.permute(1, 0, 2)  # Transformer expects [seq_len, batch, features]
    output = model(data)
    loss = criterion(output, target)  # Directly compare output and target
    loss.backward()
    optimizer.step()
    print(f"Train Epoch: 1 [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}")


The error and warning you're encountering are due to a shape mismatch between the model's output and the target tensors during the loss computation. Specifically, your model's output has the shape **[batch_size, 168, 1]**, while your target tensor has the shape **[batch_size, 24, 1]**. This discrepancy leads to the RuntimeError because the dimensions do not align for a valid Mean Squared Error (MSE) computation.
<br><br>
The cause of this issue is that the Transformer model is designed to output a sequence with the same length as the input sequence (168 steps in your case), but you are trying to compare it against a target sequence of only 24 steps.

**Solution**
<br><br>
To solve this issue, you need to modify your model or the way you handle its output so that it matches the target's shape. One straightforward approach is to adjust the Transformer model's output processing to select or aggregate its output to match the target size of 24 steps.
<br><br>
Given the structure of your model, where the output dimensionality is **[batch_size, seq_len, features]**, and you want to predict the next 24 hours (forecast_length), you should modify the forward method of your model to correctly shape the output. Here's how you might adjust your model:
<br><br>
Adjusting the Model's Output
<br><br>
One way to adjust the model's output is to focus on the last **forecast_length** outputs for comparison with the target. However, since your model outputs one value per timestep across the encoder's entire sequence length (168 steps), you need to select a subset of these steps that aligns with your forecasting objective.
<br><br>
A simple approach is to adjust the model output within the **forward** method to return only the last **forecast_length** timesteps:

In [None]:
# 修正後
for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    data = data.permute(1, 0, 2)  # [seq_len, batch, features]
    output = model(data)
    # Here, ensure output and target shapes are aligned
    output = output[:, -forecast_length:, :]  # Focus on the last 'forecast_length' outputs
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()


This adjustment ensures that during the loss computation, only the last 24 outputs (corresponding to the **forecast_length**) from the model are compared against the 24-hour targets, resolving the shape mismatch issue.

## Step6: Validation loop

In [21]:
# Validation loop
model.eval()  # Set model to evaluation mode
val_loss = 0
with torch.no_grad():
    for data, target in val_loader:
        data = data.permute(1, 0, 2)  # Adjusting data dimensions for the model
        output = model(data)
        # Ensure output shape matches target shape [batch_size, forecast_length, 1]
        # No need for view or reshape if model output is correctly sized
        val_loss += criterion(output, target).item()  # Sum up batch loss

val_loss /= len(val_loader.dataset)
print(f'\nValidation set: Average loss: {val_loss:.4f}\n')



Validation set: Average loss: 26.4183



### error code at validation loop

This modification assumes that your model's output already has the correct shape **[batch_size, forecast_length, 1]**, as intended after our previous adjustments to the model. This way, you directly compare **output** and **target** without additional reshaping, ensuring that the shapes align for the loss calculation.
<br><br>
If your model's output does not inherently have the correct shape, you may need to revisit and ensure that the model's forward method or the post-processing of its output (just before the loss calculation) correctly aligns the output shape with the target tensor shape. Since we've ensured the model's output should be **[batch_size, forecast_length, 1]**, this direct comparison in the validation loop should now work without issues.

In [None]:
# Validation loop
model.eval()  # Set model to evaluation mode
val_loss = 0
with torch.no_grad():
    for data, target in val_loader:
        data = data.permute(1, 0, 2)  # Adjusting data dimensions for the model
        output = model(data)
        val_loss += criterion(output.view(-1, output_dim), target).item()  # Sum up batch loss

val_loss /= len(val_loader.dataset)
print(f'\nValidation set: Average loss: {val_loss:.4f}\n')