# Transformers for TimeSeries
* https://www.youtube.com/watch?v=NGzQpphf_Vc&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi&index=49
* https://github.com/jeffheaton/app_deep_learning/blob/main/t81_558_class_10_3_transformer_timeseries.ipynb

In [None]:
try:
    import google.colab
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
import torch
has_mps = torch.backends.mps.is_built()
device = "mps" if has_mps else "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Note: using Google CoLab
Using device: cuda


The transformative landscape of deep learning has witnessed monumental strides(進展) in the recent past, particulary in the domain of Natural Langeage Processing (NLP). Central to this revolution has been the advent(出現) of transformer architectures, which, with their attention mechanisms, have pushed teh boundaries of what's achievable in tasks like machine translation, sentiment analysis, and language modeling. However, while transformers initially rose to prominence primarily within the realm of NLP, their models for time-series predictions--a challenge that, though numerically distinct, bears conceptual resemblance to understanding sequences in language.

In time-series prediciton, the objective often centers around forecasting future values based on historical data. This could involve predicting stock prices, weather patterns, or even the consumption of electricity in a region. At its core, this is a sequence-to-sequence task, where the past values form an input sequence and the future values to be predicted form an output sequence. Now, consider the silmilarities with machine translation in NLP, where an input sequence (sentence) in one language is translated into an output sequence in another language. Both scenarios require the model to recognize patterns, interdependencies, and context across sequences.

This chapter delves deep into the nuances of using PyTorch transformers for time-series prediction. We will embark(乗船する) on this journey by first establishing a foundational understanding of how transformers operate within NLP space, before segueing into their adaptation for numeric sequences. By juxtaposing these two applications, readers will gain a comprehensive appreciation of the transformer's versatility(多用途性) and the subtle considerations required when transitioning from text to time.



## Load Sun Spot Data for a Transformer Time Series

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import StandardScaler
from torch.optim.lr_scheduler import ReduceLROnPlateau

names = ['year', 'month', 'day', 'dec_year', 'sn_value',
         'sn_error', 'obs_num', 'unused1']
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/SN_d_tot_V2.0.csv",
    sep=';', header=None, names=names,
    na_values=['-1'], index_col=False)

The data preprocessing is the same as was introduced in the previous section. We will use data before the year 2000 as training, the rest is used for validation.

In [None]:
# Data Preprocessing
start_id = max(df[df['obs_num'] == 0].index.tolist()) + 1
df = df[start_id:].copy()
df['sn_value'] = df['sn_value'].astype(float)
df_train = df[df['year'] < 2000]
df_test = df[df['year'] >= 2000]

spots_train = df_train['sn_value'].to_numpy().reshape(-1, 1)
spots_test = df_test['sn_value'].to_numpy().reshape(-1, 1)

scaler = StandardScaler()
spots_train = scaler.fit_transform(spots_train).flatten().tolist()
spots_test = scaler.transform(spots_test).flatten().tolist()

Just like we did for LSTM in the previous section, we again break the data into sequences.

In [None]:
# Sequence Data Preparation
SEQUENCE_SIZE = 10

def to_sequences(seq_size, obs):
    x = []
    y = []
    for i in range(len(obs) - seq_size):
        window = obs[i:(i + seq_size)]
        after_window = obs[i + seq_size]
        x.append(window)
        y.append(after_window)

    return torch.tensor(x, dtype=torch.float32).view(-1, seq_size, 1), torch.tensor(y, dtype=torch.float32).view(-1, 1)

x_train, y_train = to_sequences(SEQUENCE_SIZE, spots_train)
x_test, y_test = to_sequences(SEQUENCE_SIZE, spots_test)

# Setup data loaders for batch
train_dataset = TensorDataset(x_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

test_dataset = TensorDataset(x_test, y_test)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [None]:
for x, y in train_loader:
    print(f"Shape of x [batch, time, features]: {x.shape}")
    print(f"Shape of y [batch, features]: {y.shape}\n")
    break

Shape of x [batch, time, features]: torch.Size([32, 10, 1])
Shape of y [batch, features]: torch.Size([32, 1])



## Positional Encoding for Transformers
In the realm of the transformer architecture, a pivotal component that ensures the model's success is its ability to consider the sequence's order. Unlike traditional RNNs or LSTMs, which process sequences step-by-step and inherently respect their order, transformers process all tokens in a sequence simultaneously. While this parallel processing significantly boosts computational efficiency and allows for long-range dependencies to be captured more effectively, it also means that transformers, in their native form, are oblivious(気づかない) to the position or ordder of tokens in a sequence. This is where the concept of positional encoding comes into play.

Positional encoding is a mechanism to provide the transformer with information about the position of tokens within a sequence. Essentially, it infuses(注ぐ) order information into the otherwise position-agnostic embeddings. By adding positional encodings to the token embeddings before feeding them into the transformer, each token's position in the sequence becomes discernible(識別可能な) to the model.

Positional encodings are vectors that get added to the embeddings of tokens. The intuition is to design these vectors in such a way their values or patterns are unique for each position, allowing the model to differentiate between different positions in the sequence.

A popular method to generate positional encodings is using sinusoidal functions. For a given position $p$  in the sequence and dimension $d$ of the embedding, the positional encoding is computed as:
$$ PE(2,i) = \sin(\frac{p}{10000^{2i/d}}) $$
$$ PE(2,i+1) = \cos(\frac{p}{10000^{2i/d}}) $$

Where $i$ is the dimension index. These sinusoidal function generate values between -1 and 1 and ensure a unique and repeatable pattern for each position.

The choice of sinusoidal functions isn't arbitary(任意の). They have two compelling properties:
1. They produce values between -1 and 1, making them compatible with most embedding value ranges.
2. Their patterns allow the model to extrapolate positions beyond the sequence lengths seen during training.

One might wonder, why not just append or add the token's position as an integer to the embedding? The challenge with this approach is scale. Embedding values, especially after being trained, can exist within a specific range, and directly adding large integers (for tokens further down in long sequences) might disrupt(混乱させる) the information in the original embeddings.

Furthermore, using raw intergers wouldn't provide a consistent way for the model to generalize or extrapolate to sequence lengths not seen during training. Sinusoidal functions, in contrast, offer a predictable pattern that aids in such extrapolation.

The following code describes a simple implementation of a transformer-based model using PyTorch's built-in functionnalities. The **TransformerModel** class encapsulates a transformer-based neural network designed for sequence processing. Upon initialization, the model sets up several components: an encoder to adjust the input data to a desired dimensions, a **pos_encoder** to bestow(授ける) the sequence with positional information, a core **transformer_encoder** comprising several layers to process the sequence, and a **decoder** to produce the final output. As data flows through the model during the forward pass, it undergoes a series of transformations: it's first projected to a higher dimension, then augmented(増加させる) with positional encodings, processed by the transformer layers, and finally, the last token's representation is harnessed to produce the output. An instance of this model is readily created and can be assigned to a comuputaion device for further training or inference.

In [None]:
# Positional Encoding for Transformer
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

## Constructing the Trnsformer Model
The following code constructs the actual transformer-based model for time series prediction. The model is constructed to accept the following parameters.
* **input_dim**: The dimension of the input data, in this case we use only one input, the number of sunspots.
* **d_model**: The number of features in the transformer model's internal representations (also the size of embeddings). This controls how much a model can remember and process.
* **nhead**: The number of attention heads in the multi-head self-attention mechanism.
* **num_layers**: The number of transformer encoder layers.
* **dropout**: The dropout probability.

In [None]:
# Model definition using Trnsformer
class TransformerModel(nn.Module):
    def __init__(self, input_dim=1, d_model=64, nhead=4, num_layers=2, dropout=0.2):
        super(TransformerModel, self).__init__()

        self.encoder = nn.Linear(input_dim, d_model)
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = nn.TransformerEncoderLayer(d_model, nhead)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
        self.decoder = nn.Linear(d_model, 1)

    def forward(self, x):
        x = self.encoder(x)
        x = self.pos_encoder(x)
        x = self.transformer_encoder(x)
        x = self.decoder(x[:, -1, :])
        return x

model = TransformerModel()
model.to(device)

TransformerModel(
  (encoder): Linear(in_features=1, out_features=64, bias=True)
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=64, out_features=64, bias=True)
        )
        (linear1): Linear(in_features=64, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=64, bias=True)
        (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (decoder): Linear(in_features=64, out_features=1, bias=True)
)

The transformer architecture in PyTorch is governed by crucial configuration choices, among which **d_model**, **n_head**, and **num_layers** hold significant weight.

The **d_model** denotes the dimensionality of the input embeddings and affects the model's capacity to learn intricate representations.
While a more substantial **d_model** can bolster(強化する) the richness of the model's understanding, it also amplifies(増幅する) the computational demand and can pose overfitting risks if not carefully chosen. Parallely, the model's gradient flow and initialization are impacted by this choice, though the Transformer's normalization layers often moderate potential issues.


On the other hand, **n_head** reflects the count of heads in the multi-head attention mechanism. A higher number of heads grants the model the prowess to simultaneously focus on diverse segments of the input, enabling the capture of varied contextual nuances. However, there's a trade-off. Beyond a specific threshold, the computational overhead might outweigh the marginal(ギリギリの) performance gains. This parallel processing, provided by multiple attention heads, tends to offer more stable and varied gradient information, positively influencing the training dynamics.


Lastly, the **num_layers** parameter dictates the depth of the Transformer, determining the number of stacked encoder or decoder layers. A deeper model, as a result of increased layers, can discern(識別する) more complex and hierarchical relationships in data. Still, there's a caveat（警告): after a certain depth, potential performance enhancements may plateau(停滞する) and the risk of overfitting might escalate. Training deeper models also comes with its set of challenges. Although residual connections and normalization in Transformers alleviate some concerns, a high layer count might necessitate techniques like gradient clipping or learning rate adjustments for stble training.


In essence, these parameters intricately(複雑に) balance model capacity, computational efficiency, and generalization capability. Their optimal settings often emerge from task-specific experimentation, the nature of the data, and available computational prowess(力量).



## Training the Model
Training a transformer-based model adheres(を遵守する、くっつく) to many of the familiar paradigms and best practices that apply to other neural network architectures. Much like the models we've encountered before, a transformer-based model benefits from training in batches, which helps in both computational efficiency and generalization. Batched training ensures that the model updates its weights based on teh average gradient over several data points, rather than being excessively influenced by any single instance. Additionally, the use of early stopping acts as a safeguard against overfitting. By monitoring the model's performance on a validation set and halting training when no significant improvement is observed over a set number of epochs, we ensure that the model generalizes well and doesn't just memorize the training data. The validation set, it remains an essential component in the training regimen, providing a proxy measure of the model's performance on unseen data and guiding hyperparameter tuning. Thus, while transformer architectures introduce novel mechanisms and complexities, the foundational principles of training deep learning models in PyTorch remain consistent.

In [None]:
# Train the model
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3, verbose=True)

epochs = 1000
early_stop_count = 0
min_val_loss = float('inf')

for epoch in range(epochs):
    model.train()
    for batch in train_loader:
        x_batch, y_batch = batch
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)

        optimizer.zero_grad()
        y_pred = model(x_batch)
        loss = criterion(y_pred, y_batch)
        loss.backward()
        optimizer.step()

    # Validation
    model.eval()
    val_losses = []
    with torch.no_grad():
        for batch in test_loader:
            x_batch, y_batch = batch
            x_batch = x_batch.to(device)
            y_batch = y_batch.to(device)
            y_pred = model(x_batch)
            loss = criterion(y_pred, y_batch)
            val_losses.append(loss.item())

    val_loss = np.mean(val_losses)
    scheduler.step(val_loss)

    if val_loss < min_val_loss:
        min_val_loss = val_loss
        early_stop_count = 0
    else:
        early_stop_count += 1

    if early_stop_count >= 5:
        print("Early stopping!")
        break

    print(f"Epoch {epoch+1}/{epochs}, Train Loss: {loss.item():.4f}, Val Loss: {val_loss:.4f}")



Epoch 1/1000, Train Loss: 0.3037, Val Loss: 0.3037
Epoch 2/1000, Train Loss: 0.0413, Val Loss: 0.0413
Epoch 3/1000, Train Loss: 0.1478, Val Loss: 0.1478
Epoch 4/1000, Train Loss: 0.0707, Val Loss: 0.0707
Epoch 5/1000, Train Loss: 0.0511, Val Loss: 0.0511
Epoch 6/1000, Train Loss: 0.0227, Val Loss: 0.0227
Epoch 7/1000, Train Loss: 0.0334, Val Loss: 0.0334
Epoch 8/1000, Train Loss: 0.1481, Val Loss: 0.1481
Epoch 9/1000, Train Loss: 0.0855, Val Loss: 0.0855
Epoch 10/1000, Train Loss: 0.0348, Val Loss: 0.0348
Epoch 11/1000, Train Loss: 0.0141, Val Loss: 0.0141
Epoch 12/1000, Train Loss: 0.0850, Val Loss: 0.0850
Epoch 13/1000, Train Loss: 0.0859, Val Loss: 0.0859
Epoch 14/1000, Train Loss: 0.0530, Val Loss: 0.0530
Epoch 15/1000, Train Loss: 0.1756, Val Loss: 0.1756
Early stopping!


We can now evaluate the performance of this model.

In [None]:
# Evaluation
model.eval()
predictions = []
with torch.no_grad():
    for batch in test_loader:
        x_batch, y_batch = batch
        x_batch = x_batch.to(device)
        # y_batch = y_batch.to(device)
        y_pred = model(x_batch)
        predictions.extend(y_pred.squeeze().tolist())

rmse = np.sqrt(np.mean((scaler.inverse_transform(np.array(predictions).reshape(-1, 1)) - scaler.inverse_transform(y_test.numpy().reshape(-1, 1)))**2))
print(f"Score (RMSE): {rmse:.4f}")

Score (RMSE): 14.8968


# The detail of Positional Encoding
```python
# Positional Encoding for Transformer
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)
```

## 1. Class Initialization
```python
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
```
* P**ositionalEncoding(nn.Module)**: This defines a new class PositionalEncoding that inherits from nn.Module, which is the base class for all neural network modules in PyTorch.
* **__init__(self, d_model, dropout=0.1, max_len=5000)**: The constructor method initializes the class with three parameters: d_model (the dimension of the model), dropout (dropout rate), and max_len (the maximum length of the sequences).
* **super(PositionalEncoding, self).__init__()**: This line calls the constructor of the parent class nn.Module.
* **self.dropout = nn.Dropout(p=dropout)**: This line initializes a dropout layer with the given dropout rate.

## 2. Creating the Positional Encoding Matrix
```python
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
```
* **pe = torch.zeros(max_len, d_model)**: This creates a tensor **pe** of shape **(max_len, d_model)** filled with zeros. This tensor will hold the positional encoding
* **postion = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)**: This creates a tensor of shape **(max_len, 1)** constaining position indices from 0 to max_len-1.
* **div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0 / d_model))**: This creates a tensor of shape **(d_model/2)** containing the exponential terms used to scale the positions.

```python
torch.arange(0, 5, dtype=torch.float)
# -> tensor([0., 1., 2., 3., 4.])

torch.arange(0, 5, dtype=torch.float).unsqueeze(1)
# -> tensor([[0.],
#        [1.],
#        [2.],
#        [3.],
#        [4.]])

torch.arange(0, 5, 2).float()
# -> tensor([0., 2., 4.])
```

## 3. Applying Sine and Cosine Functions
```python
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
```
* **pe[:, 0::2] = torch.sin(position * div_term)**: This applies the sine function to even-indexed dimensions of **pe**.
* **pe[:, 1::2] = torch.cos(position * div_term)**: This pplies the cosine function to odd-indexed dimensions of **pe**.

## 4. Reshaping and Registering the Buffer
```python
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
```
* **pe = pe.unsqueeze(0).transpose(0, 1)**: This adds a new dimension to **pe** and then transpose it. The shape of **pe**  becomes **(max_len, 1, d_model)**.
* **self.register_buffer('pe', pe)**: This registers **pe** as a buffer in the module. Buffers are tensors that are not considered model parameters but are part of the module's state.

## 5. Forward Method
```python
    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)
```
* **def forward(self, x)**: The **forward** method defines the computation performed at every call of the module.
* **x = x + self.pe[:x.size(0), :]**: This lines adds the positional encoding to the input tensor **x**.
* **return self.dropout(x)**: This applied dropout to the output tensor and returns it.

## Creating Synthetic Data and Running the Class

In [None]:
import torch
import torch.nn as nn
import numpy as np

# Positional Encoding for Transformer
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

# Synthetic data
batch_size = 2
seq_len = 10
d_model = 16

# Create a random tensor with the shape (seq_len, batch_size, d_model)
x = torch.randn(seq_len, batch_size, d_model)

In [None]:
print(x.shape)

torch.Size([10, 2, 16])


In [None]:
x[-1]

tensor([[-0.0671, -2.2780, -1.2814,  2.0654,  0.1273, -0.0603, -2.0497, -1.4655,
         -0.5372,  1.3063,  0.5705,  1.1857,  0.5660, -0.2755, -0.2692,  0.9791],
        [-1.5609, -0.3353,  0.5599, -1.2867, -0.6484, -2.7458,  0.5772, -1.5535,
         -0.4650, -0.3856, -0.1863, -0.3983,  0.9908, -2.0183, -1.0985, -0.9316]])

In [None]:
# Initialize the PositionalEncoding module
pos_encoder = PositionalEncoding(d_model)

# Pass the input tensor through the PositionalEncoding module
x_encoded = pos_encoder(x)

print(x_encoded.shape)

torch.Size([10, 2, 16])


In [None]:
x_encoded[-1]

tensor([[ 0.3834, -3.5435, -1.1001,  1.2319,  1.0118,  0.6236, -1.9655, -0.5619,
         -0.4970,  0.0000,  0.6655,  2.4281,  0.0000,  0.0000, -0.2959,  2.1990],
        [-1.2765, -1.3849,  0.9457, -2.4926,  0.1499, -2.3602,  0.9534, -0.6597,
         -0.0000,  0.6781, -0.1754,  0.6681,  1.1109, -1.1315, -1.2174,  0.0759]])

## About `0::2` in `pe[:, 0::2]`
The notation **`0::2`** in PyTorch is a slicing operation, and it means "start at index 0, go to the end, and take every 2ned element".

### Understanding the Slice Notation `0::2`
* **0**: This is the start index. It tells the slice to start from the first element (index 0).
* **`:`**: This colon means "go up to the end" (it can be omitted in this context, but it's there fo clarity).
* **2**: This is the step value. It means "take every 2nd element".

### Example
```python
list_example = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
sliced_list = list_example[0::2]
print(sliced_list)
```
The output will be:
```python
[0, 2, 4, 6, 8]
```
This means it starts at index 0 and picks every 2nd element from the list.

### Applying to `pe[:, 0::2]`
Now, applying this to **`pe[:, 0::2]`**:
* **`pe[:, 0::2]`**:
    * **`:`** before the comma means all rows.
    * **`0::2`** after the comma means start at column 0 and take every 2nd column.

### Example in 2D Tensor
Consider a 2D tensor example to visualize this:

In [None]:
import torch

# Create a tensor with shape (5, 10)
tensor_example = torch.arange(50).reshape(5, 10)
print("Original Tensor:\n",tensor_example)

Original Tensor:
 tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]])


In [None]:
# Apply the slice
sliced_tensor = tensor_example[:, 0::2]
print("Sliced Tensor:\n", sliced_tensor)

Sliced Tensor:
 tensor([[ 0,  2,  4,  6,  8],
        [10, 12, 14, 16, 18],
        [20, 22, 24, 26, 28],
        [30, 32, 34, 36, 38],
        [40, 42, 44, 46, 48]])


As you can see, the sliced tensor contains every 2nd column from the original tensor.

### Applying to Positional Encoding
In the positional encoding context:
* **`pe[:, 0::2] = torch.sin(position * div_term)`**: This applies the sine function to the odd-indexed columns of **pe**.
* **`pe[:, 1::2] = torch.cos(position * div_term)`**: This applies the cosine function to the even-indexed columns of **pe**.

### Summary
* **`0::2`** means "start at index 0 and take every 2nd element until end".
* In the context of **`[:, 0::2]`**, it applies the sine function to odd-indexed dimensions (columns) and the cosine function to even-indexed dimenstions (columns).

## About `register_buffer`
**`register_buffer`** is a method in PyTorch used within **`nn.Module`** to register a tensor as a buffer. A buffer is a tensor that is not a model parameter but is still part of the module's state. These buffers are typiccally not updated during training (i.e., they do not require gradients), but they are useful for things like positional encodings, running statistics in batch noramalization, or other fixed data that needs to be stored in the module.

### Key Points About `regisuter_buffer`
1. **State Management**:
    * Buffers are part of the module's state. They are saved and loaded when you save or load the model's state dictionary. This makes them useful for storing data that is part of the model but should not be treated as a parameter.
2. **Fixed Data**:
    * Buffers are usually used for fixed data that does not require gradient updates, such as constants or other intermediate computations that should be preserved.
3. **Non-learnable Parameters**:
    * Unlike parameters registered using **register_parameter**, buffers do not require gradients and do not appear in the list of parameters returned by **parameters()**.
    



### Example Usage in Posigional Encoding
In the **PositionalEncoding** class, **register_buffer* is used to store the positional encoding teosor **pe**:

In [None]:
d_model = 10
max_len = 5000

pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
print(pe.shape)

torch.Size([5000, 10])


In [None]:
pe = pe.unsqueeze(0).transpose(0, 1)
pe.shape

torch.Size([5000, 1, 10])

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

Here's what happens with **registerbuffer** in this context:
* **Creating the Buffer**: **pe** is a tensor that contains the precomputed positional encodings. It is created and populated with values based on the sine and cosine function.
* **Registering the Buffer**: **self.resigter_buffer('pe', pe)** registers **pe** as a buffer in the module. This means **pe** will be part of the module's state, saved, and loaded with the model, but it will not be treated as a learnable parameter.
* **Accessing the Buffer**: In the **forward** method, the buffer **pe** can be accessed using **self.pe**, and it is added to the input tensor **x**.

### Why Use `register_buffer`?
Using **register_buffer** ensures that:
* The tensor is included in the module's state dictionary, which is important for saving and loading the model.
* The tensor is not treated as a parameter that requires gradient updates.
* The tensor can be moved to the appropriate device(CPU/GPU) along with the rest of the model using **.to()** or **.cuda()** methods.

### Practical Example

In [None]:
import torch
import torch.nn as nn

class ExampleModule(nn.Module):
    def __init__(self):
        super(ExampleModule, self).__init__()
        self.param = nn.Parameter(torch.randn(3, 3))
        buffer = torch.ones(3, 3)
        self.register_buffer('buffer', buffer)

    def forward(self, x):
        return self.param + self.buffer


# Create an instance of the module
module = ExampleModule()

# Print the state dictionary
print("State Dict:", module.state_dict())

State Dict: OrderedDict([('param', tensor([[ 0.9268, -0.3773, -0.9999],
        [ 1.3651, -0.8092,  1.0600],
        [-0.1706,  0.2358, -1.5510]])), ('buffer', tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]))])


In [None]:
# Check if buffer is included in the state dict
assert 'buffer' in module.state_dict()

In [None]:
# Print the parameters of the module
print("Parameters:", list(module.parameters()))

Parameters: [Parameter containing:
tensor([[ 0.9268, -0.3773, -0.9999],
        [ 1.3651, -0.8092,  1.0600],
        [-0.1706,  0.2358, -1.5510]], requires_grad=True)]


In [None]:
# Print the buffer of the module
print("Buffer:", list(module.buffers()))

Buffer: [tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])]


In this example:
* **self.param** is a learnable parameter registerd using **nn.Parameter**.
* **buffer** is a fixed tensor registered using **register_buffer**.

When you print the state dictionary of the module, both **param** and **buffer** will be included. However, **buffer** will not be listed as a parameter that requires gradients.