# Study LSTM Input structure

# Basic Structture of LSTM Input
The input to an LSTM model typicall has three dimensions:
1. **Sequence Length**: This dimension represents the length of the time series or sequence for each example. In other words, it's the number of time steps that the LSTM will unroll over or process. For instance, if you are analyzing daily stock prices for a month, the sequnce length could be 30 (days).
2. **Batch Size**: This dimension represents the number of sequences processed in a single batch. In machin learning, it's common to process data in batched for efficiency. Each batch is a collection of sequences that the model will process simultaniously.
3. **Features**: This dimension represents the number of features (or input variables) available at each time step of each sequence. If you're predicting electricity consumption based on temperature and humidity at each time step, you would have 2 features.


## LSTM model definition

CLASS **torch.nn.LSTM(self, input_size, hidden_size, num_layers=1, bias=True, batch_first=False, dropout=0.0, bidirectional=False, proj_size=0, device=None, dtype=None)**
<br><br>
* Parameters
    * input_size – The number of **expected features** in the input x
    * hidden_size – The number of features in the hidden state h
    * num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
    * bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
    * batch_first – If True, then the input and output tensors are provided as (batch, seq, feature) instead of (seq, batch, feature). Note that this does not apply to hidden or cell states. See the Inputs/Outputs sections below for details. Default: False
    * dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
    * bidirectional – If True, becomes a bidirectional LSTM. Default: False
    * proj_size – If > 0, will use LSTM with projections of corresponding size. Default: 0

## Input and Output sturucture of LSTM
* Inputs: input, (h_0, c_0)
    * **input**: tensor of shape **(L, H_in)** for unbatched input, **(L, N, H_in)** when `batch_first=False` or **(N, L, H_in)** when `batch_first=True` containing features of the input sequence. The input can also be a packed variable length  sequence.
    * **h_0**: tensor of shape (D * num_layers, H_out) for unbatched input or (D * num_layers, N, H_out) containing the initial hidden state for each element in the input sequence. Defaults to zeros if (h_0, c_0) is not provided.
    * **c_0**: tensor of shape (D * num_layers, H_cell) for unbatched input or (D * num_layers, N, H_cell) containing the initial cell state for each element in the input sequence. Defaluts to zeros if (h_0, c_0) is not provided.
    * where:
        * N = batch size
        * L = sequence length
        * D = 2 if bidirectional = True otherwise 1
        * H_in = input_size (features nums)
        * H_cell = hidden_size
        * H_out = proj_size if proj_size > 0 otherwise hidden_size
* Outputs: output, (h_n, c_n)
    * **output**: tensor of shape **(L, D * H_out)** for unbathed input, **(L, N, D * H_out)** when `batch_first=False` or **(N, L, D * H_out)** when `batch_first=True` containing the output features (h_t) from the last layer of the LSTM, for each t . if a `torch.nn.utils.rnn.PackedSequence` has been given as the input, the output will also be a packed sequence. When `bidrectional=True`, output will contain a concatenation of the forward and reverse hidden states at each time step in the sequence.
    * **h_n**: tensor of shape (**D * num_layers, H_out)** for unbatched input or **(D * num_layers, N, H_out)** containing the final hidden state for each element in the sequence. When `bidirectional=True`, h_n will contain a concatenation of the final forward and reverse hidden states, repectively.
    * **c_n**: tensor of shape **(D * num_layers, H_cell)** for unbatched input or **(D * num_layers, N, H_cell)** containing the final cell state for each slement in the sequence. When `bidirectional=True`, c_n will contain a concatenation of the final forward and reverse cell states, respectively.


# Example 1
Supper you're working with a dataset where you have 100 sequences (e.g., 100 different stocks), each sequence consists of 30 time steps (e.g., a month of daily observations), and at each time step, you have 5 features (e.g., opening price, closing price, volume, high, low).
<br><br>
The input to your LSTM would then need to be reshaped or structured as follows:
* **Sequence Length**: 30 (time steps)
* **Batch Size**: 100 (sequences)
* **Features**: 5 (features per time step)

## In PyTorch
When defining an LSTM in PyTorch, you would typically specify the `input_size` and `hidden_size` parameters in the LSTM layer. <b><u>The `input_size` corresponds to the number of features per time step</u></b> (the third dimension in the input structure), and `hidden_size` corresponds to the number of features in the hidden state `h`.
<br><br>
Here's how you might define such an LSTM in PyTorch:

In [None]:
import torch.nn as nn

# Assuming 5 features per time step and a hidden layer size of 50
lstm_layer = nn.LSTM(input_size=5, hidden_size=50, batch_first=True)

Note that setting `batch_first=True` in PyTorch's LSTM layer means that the input and output tensors are provided as <b>`(batch, seq, feature)`</b> which aligns with the intuitive understanding of batch size, sequence length, and features per time step.

## Handling Variable Sequence Length (可変長のsequence lengthの取り扱い)
In practice, you might encounter datasets where sequences have variable lengths. Handling such cases efficiently requires additional steps, such as padding shorter sequences to match the longest one in a batch or using PyTorch's `pack_padded_sequence` and `pad_packed_sequence` utilities to handle varialbe lengths withot unnecessary computation.

## Data Structure when calling forward() method
When you use an LSTM layer in PyTorch with `batch_first=True`, your data needs to be structured so that the batch dimension comes first. This means that your input tensor should be shaped as <b>(batch_size, seq_len, input_size)</b>, where
<br><br>
* **batch_size** is the number of sequences in each batch.
* **seq_len** is the number of time steps in each sequence.
* **input_size** is the number of features per time step (5 in your example).

<br><br>
Given the **lstm_layer** you've defined (**nnLSTM(input_size=5, hidden_size=50, batch_first=True**)
let's say you want to process a batch of 10 sequences (batch_size=10), each sequence being of length 30 time steps (seq_len=30), with 5 features at each time step (input_size=5).
<br><br>
Here's how you would prepare your data and call the `forward()` method of your LSTM layer:


### Step1: Prepare Your Data
Ensure your input data tensor has the shape **(batch_size, seq_len, input_size)**. For demonstration, we'll create a dummy tensor matching these specifications:

In [None]:
import torch.nn as nn
import torch

# Assuming 5 features per time step and a hidden layer size of 50
lstm_layer = nn.LSTM(input_size=5, hidden_size=50, batch_first=True)

# Create a dummy input tensor of shape (batch_size, seq_len, input_size)
batch_size = 10
seq_len = 30
input_size = 5 # Number of features

# Assuming your data is in a NumPy or a list of lists, you
# would convert it to a PyTorch tensor like this
input_tensor = torch.randn(batch_size, seq_len, input_size) # Random data for demonstration

In [None]:
input_tensor.shape
# (batch_size, seq_len, input_size)

torch.Size([10, 30, 5])

### Step2: Initialize the Hidden State (Optional)
Initializing the hidden state is optional as Pytorch automatically initializes it to zero if not provided. However, if you want to initialize it manually, you need to provide both the hidden state `h_0` and the cell state `c_0` tensors with the shape **(num_layers * num_directions, batch_size, hidden_size)**. For a single layer LSTM that is not bidirectional, this simplifies **(1, batch_size, hidden_size)**:

In [None]:
hidden_size = 50 # As defined in your last_layer
num_layers = 1 # Assuming a single layer LSTM
num_directions = 1 # Assuming a unidirectional LSTM

h_0 = torch.zeros(num_layers * num_directions, batch_size, hidden_size)
c_0 = torch.zeros(num_layers * num_directions, batch_size, hidden_size)

print(h_0.shape)
print(c_0.shape)

torch.Size([1, 10, 50])
torch.Size([1, 10, 50])


### Step3: Call the **`foreward()`** Method
Now, you're ready to pass your input tensor (and optionally the initial hidden and cell states) through the LSTM layer:

In [None]:
# Assuming lstm_layer is your LSTM layer defined as per the question
# input_tensor.shape -> (batch_size, seq_len, input_size)
out, (h_n, c_n) = lstm_layer(input_tensor, (h_0, c_0))

In [None]:
out.shape
# (batch_size, seq_length, hidden_size)

torch.Size([10, 30, 50])

In [None]:
out[-1].shape

torch.Size([30, 50])

In [None]:
out[:, -1, :].shape

torch.Size([10, 50])

In [None]:
print(h_n.shape)
print(c_n.shape)
# (num_layers * num_directions, batch_size, hidden_size)

torch.Size([1, 10, 50])
torch.Size([1, 10, 50])


In [None]:
c_n[-1].shape

torch.Size([10, 50])

Or, if you're letting PyTorch initialize the hidden state automatically:

In [None]:
out, (h_n, c_n) = lstm_layer(input_tensor)

Here, **`out`** contains the output features from the LSTM layer for each time step; **`h_N`** is the final hidden state, and **`c_n`** is the final cell state. The shapes of these outputs are:

* **out**: **(batch_size, seq_len, hidden_size)** - Output features from the LSTM for each time step.
* **h_n, c_n**: **(num_layers * num_directions, batch_size, hidden_size)** - Final hidden and cell state.


This sturcture allows the LSTM to process batches of sequences, making trainig more efficient and facilitating the handling of data with a temporal dimension.

# Example 2
gpt4 -  I'll guide you through creating a dummy dataset to predict stock prices using PyTorch, including defining a dataset, using DataLoader, and setting up a simple LSTM model training loop. This example will focus on using past 7 days of data (sequence length = 7) with 5 features (high price, low price, volume, GDP, CPI) to predict the stock price.

## Step1: Imports and Setup

In [None]:
import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import Dataset, DataLoader

# Setting a random seed for reproducibility
torch.manual_seed(0)
np.random.seed(0)


## Step2: Crate Dummy Data

In [None]:
# Parameters
num_samples = 100  # Number of total samples in the dataset
seq_length = 7    # Past 7 days data
num_features = 5  # high price, low price, volume, GDP, CPI

# Generate synthetic features
features = np.random.rand(num_samples, num_features).astype(np.float32)

# Generate synthetic stock prices (targets), just for the example
targets = np.random.rand(num_samples, 1).astype(np.float32)

print("Features shape:", features.shape)
print("Targets shape:", targets.shape)


Features shape: (100, 5)
Targets shape: (100, 1)


In [None]:
features[:3]

array([[0.5488135 , 0.71518934, 0.60276335, 0.5448832 , 0.4236548 ],
       [0.6458941 , 0.4375872 , 0.891773  , 0.96366274, 0.3834415 ],
       [0.79172504, 0.5288949 , 0.56804454, 0.92559665, 0.07103606]],
      dtype=float32)

In [None]:
targets[:3]

array([[0.31038082],
       [0.37303486],
       [0.5249705 ]], dtype=float32)

## Step3: Difine a Custom Dataset
* use **torch.utils.data.Dataset**

In [None]:
class StockDataset(Dataset):
    def __init__(self, features, targets, seq_length):
        self.features = features
        self.targets = targets
        self.seq_length = seq_length

    def __len__(self):
        return len(self.features) - self.seq_length

    def __getitem__(self, idx):
        # make sequence data here
        return self.features[idx:idx+self.seq_length], self.targets[idx+self.seq_length]

# Create teh dataset(x, y, sequence_length)
dataset = StockDataset(features, targets, seq_length)

print("Dataset size: ", len(dataset))

Dataset size:  93


## Step4: DataLoader
* Utilize DataLoader for batching

In [None]:
batch_size = 20 # 20 samples per batch

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

print("Features shape:", features.shape)
print("Targets shape:", targets.shape)

Features shape: (100, 5)
Targets shape: (100, 1)


### about `drop_last=True` in DataLoader
 * the **drop_last=True** argument in PyTroch's **DataLoader** constructor is used to specify how the DataLoader handles the last batch when the size of your dataset is not perfectly divisible by the batch size. When set to **True**, it means that the DataLoader will drop the last batch if its size is smaller than the specified batch size. conversely, if it's set to **False** (the default value), the DataLoader will include the last batch, even if it contains fewer elements than the batch size.

#### Why Use `drop_last=True`
* **Model Training Stability**: Some models, especially those that involve certain types of normalization (like Batch Normalization), can behave(振る舞う) unpredictably when the batch size is smaller tha expcted. Dropping the last incomplete batch ensures that all batches fed into the model during training have the same size, contributing to more stable training dynamics.
* **Consistency in Sequence Length for RNNs/LSTMs**: In sequence processing tasks (like the one with LSTMs you're working on), ensuring that each batch has a consistent number of sequences can be important for maintaing uniform sequence length, especially when not using dynamic padding or batch-specific sequence length adjustments.
* ** GPU Memory Optimization**: by maintaining consistent batch sizes, you can optimize your GPU memory usage, as varying batch sizes can lead to inefficient memory utilization.


### Considerations:
* **Data Utilization**: Setting `drop_last=True` means you might not use every data point in your dataset for training, as the last few samples might be ignored if they don't fill up a batch. In scenarios where every data point is valuable, and you want to maximize data utilization, you might prefer setting `drop_last=False`.
* ** Impact on Small Datasets**: For small datasets, dropping the last batch might lead to losing a significant portion of valuable training data, which could potentially affect model performance.


In summary, the choice of setting `drop_last` to `True` or `False` depends on the specifics of your training setup, model architecture, and how critical it is to maintain consistent batch sizes throughout training.

## Step5: Define the LSTM Model

In [None]:
class StockPricePredictor(nn.Module):
    def __init__(self, num_features, hidden_size, num_layers=1):
        super(StockPricePredictor, self).__init__()
        self.lstm = nn.LSTM(num_features, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, 1) # Predicting one stock price

    def forward(self, x):
        _, (hn, _) = self.lstm(x)
        # hn.shape -> (num_layers * num_directions, batch_size, hidden_size)
        out = self.linear(hn[-1]) # <- num_layers*num_directions軸の最後の(batch_size, hidden_size)データを取得
        return out

* ※ Why in forward step gpt4 doesn't use output from self.lstm and insted, it use hn[-1]
    * -> Refer under section!!!

## Step6: Training Loop

In [None]:
# Model instantiation
model = StockPricePredictor(num_features=num_features, hidden_size=50)

# Loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 100

for epoch in range(epochs):
    for features, targets in dataloader:
        # Forward pass
        outputs = model(features)
        loss = criterion(outputs, targets)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Print loss every 10 epochs
    if (epoch+1) % 10 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")

Epoch 10/100, Loss: 0.0875
Epoch 20/100, Loss: 0.0974
Epoch 30/100, Loss: 0.0799
Epoch 40/100, Loss: 0.0941
Epoch 50/100, Loss: 0.0750
Epoch 60/100, Loss: 0.0653
Epoch 70/100, Loss: 0.0809
Epoch 80/100, Loss: 0.0639
Epoch 90/100, Loss: 0.0717
Epoch 100/100, Loss: 0.0898


In [None]:
outputs.shape

torch.Size([20, 1])

In [None]:
targets.shape

torch.Size([20, 1])

### Notes:
* **Data Structure**: Throughout this process, partilucarly in the DataLoader and LSTM model, teh data structure follows the **(batch_size, seq_length, num_features)** format for inputs. The targets in this example are simplified to a single value per sequence, hence the shape **(batch_size, 1)** for each batch of targets.
* **Custom Dataset**: The custom dataset prepares sequences of the specified length **(seq_length)** with corresponding targets. This aligns with how LSTM expects sequential data.
* **Model Training**: This is a basic training loop without validation steps or detailed performance evaluation for simplicity.


## Tips1: About Output Data Structure
In the provide LSTM model example, the output from the **model(inputs)** call, represented by **outputs**, is generated by passing the input batch through the LSTM layer followed by a linear layer. The final output structure is directly influenced by how the model is defined, especially the output feature size of the linear layer.
<br><br>
Give the model definition:

In [None]:
class StockPricePredictor(nn.Module):
    def __init__(self, num_features, hidden_size, num_layers=1):
        super(StockPricePredictor, self).__init__()
        self.lstm = nn.LSTM(num_features, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, 1) # Predicting one stock price

    def forward(self, x):
        _, (hn, _) = self.lstm(x)
        out = self.linear(hn[-1]) # <- num_layers*num_directions軸の最後の(batch_size, hidden_size)データを取得
        return out

* The LSTM's output, in this case, is taken from the final hidden state **hn[-1]**, where **hn** is the hidden state for the last LSTM layer across all time steps. This is passed to a linear layer to predict a single value (the stock price).
* The linear layer's output size is set to 1, as we are predicting a single value per sequence.

### Output Data Structure:
* **Shape**: Since the model's output comes from a linear layer predicting a single value, and because we're processing batches of sequences, the shape of **outputs** will be **[batch_size, 1]**.
* **Meaning**: Each item in **outputs** corresponds to the model's prediction for a stock price at the end of each sequence in the batch.


For a batch size of 20, as set in your DataLoader, the shape of **outputs** would be **[20, 1]**. Each of the 20 elements in this tensor is the predicted stock price derived from the respective sequence of 7 days' worth of data with 5 features  (high price, low price, volume, GDP, CPI) fed into the model.
<br><br>
This output structure makes it straightforward to compare the model's predictions against the actual stock prices (the targets) for each sequence in the batch, which is essential for computing the loss during training and evaluating the model's performance.

## Tips2: using **`hidden_state`** and **`output`** in LSTM's outputs

In [None]:
### using last hidden_state of LSTM
def forward(self, x):
    _, (hn, _) = self.lstm(x)
    out = self.linear(hn[-1])
    return out

# -------------------------------------------------------------------------------

### usnig last output of LSTM
def forward(self, x):
    output, (hn, _) = self.lstm(x)
    out = self.linear(output[:, -1, :]) # Select the output of the last time step

The variable **hn** represents the hidden state of the LSTM layer at the last time step of the sequence. It is not the predicted output value directly but rather a **<u>representation of the information learned from the entire sequence up to that point. The prediction is made based on this information**</u>.

Here's a breakdown of the process
1. **LSTM Processing**: When input sequences are fed into an LSTM processes them one time step at a time, updating its hidden state(**hn**) and cell state(**cn**) at each step. This hidden state serves as a kind of "memory" of the sequence processed so far, incorporating information from both recent and, potentially, long-past time steps in the sequence.
2. **Hiden State as a Feature Vector**: By the time the last time step of a sequence is processed, the LSTM's hidden state(**hn**) contains a compact representation of the entire sequence's information. This makes it a powerful feature vector for making predictions about the sequence.
3. **Prediction from Hidden State**: In many LSTM applications, especially those involving sequence-to-value predictions (like stock price forecasting, where you predict a single value from a sequence of values), <u>the final hidden state is used as the input to a fully connected (linear) layer</u> to make the actual prediction. The linear layer maps the information encoded in the hidden state to the desired output format (e.g., a single stock price).


In the provide code snippet:

In [None]:
def forward(self, x):
    _, (hn, _) = self.lstm(x)
    out = self.linear(hn[-1])
    return out

* The LSTM processes the input **x**, updating its hidden state **hn** at each time step.
* After processing the entire sequence, the final hidden state(**hn[-1]**) is passed to a linear layer.
* The linear layer(**self.linear**) transforms the hidden state into the predicted output(**out**). This step effectively maps the sequence-encoded information in the hidden state to a specific prediction task, such as predicting a stock price.

This approache leverages the LSTM's ability to capture and summariaze temporal patterns and relationships in the sequence data, using the final hidden stated as a comprehensive feature representation for prediction.

### about using output insted of **`hn`**

In [None]:
    def forward(self, x):
        output, _ = self.lstm(x)  # Get the output for all time steps
        last_time_step_output = output[:, -1, :]  # Select the output of the last time step (last sequence data)
        out = self.linear(last_time_step_output)  # Feed it into the linear layer
        return out

Using the output direclty, as shown above modified `forward` method, is also a valid and common approach in sequence processing models, including LSTMs. The key difference lies in what **output** and **hn** (hidden state) represent and how they influence the model's predictions.

#### LSTM Outputs Explained:
* **Hidden State (`hn`)**: This is the final hidden state of the LSTM for each layer (if you have multiple layers) after processing the last time step of the input sequence. It represents the LSTM's "memory" of the entire sequence, capturing information from all the time steps.
* **Output (`output`)**: This tensor contains the outputs of the LSTM at each time step for the last layer. For many-to-one and many-to-many tasks, this output can be used directly for further processing or predictions. Each element of this tensor is the hidden state at a specific time step, reflecting the information available up to that point in the sequence.

### Impact on Predictions:
* **Using the Final Hidden State (`hn[-1]`)**: When you use the final hidden state as in your original code, the prediction is based solely on the summary of the entire sequence captured by the LSTM's internal memory at the last time step. This approach is well-suited for tasks where the prediction depends on the entire sequence, and you need a compact representation of all the input data.
* **Using the Last Output (`output[:, -1, :]`)**: In the modified approach, using **output[:, -1, :]** means you are basing your prediction on the LSTM output of the last time step direclty. Since **output[:, -1, :]** and **hn[-1]** are effectively the same for a single-layer LSTM in a many-to-one task (where the objective is to make a single prediction at the end of the sequence), this change does not alter the essence of the prediction. It's essentially another way to access the last hidden state for the final layer of the LSTM.

### Accuracy and Suitability:
* **No Inherent Accuracy Diiferent**: For a single-layer LSTM in a many-to-one prediction task, there's no inherent accuracy difference between using **hn[-1]** and **output[:, -1, :]** because they represent the same information. The choice between them is more about the specific requirements of your task and the architecture of your model.
* **Flexibility in Model Design**: Using **output** instead of **hn** provides flexibility, especially in many-to-many tasks where you might need access to the LSTM's outputs at each time step, not just the final one. For example, if you later decide to predict stock prices at multiple future time points instead of just the next one, output gives you the LSTM's predictions at every time step, which can be used for sequence-to-sequence predictions.

<br><br>
In summary, your modified approach using **output[:, -1, :]** is accurate and common, especially in single-layer LSTMs focused on many-to-one prediction tasks. The choice between using **output[:, -1, :]** and **hn[-1]** depends on your specific task requirements and whether your model could benefit from the additional flexibility offered by accessing the full sequence of outputs.

# difference between **`output[-1]'** and **`output[:, -1, :]`** in forward pass

The distiction between using **output[-1]** and **output[:, -1, :]** in the **forward** method of your model when feeding into a **linear** layer revolves around how PyTorch interprets indexing and the batch_first parameter of the LSTM.

## Using **`output[:, -1, :]`**
example
* output.shape -> Size(batch_size, seq_length, hidden_size)
* output.shape ->(10, 30, 50)
    * output[:, -1, :] -> (10, 50)
        - get last records of seq_length

* **Assumption**: The LSTM is defined with **batch_first=True**. This means the output tensor **output** from the LSTM has a shape of **(batch_size, seq_len, hidden_size)**.
* **Effect**: When you use **output[:, -1, :]**, you're selecting the output the last time step's output across all batches, keeping the batch size intact. The result is a tensor of shape **(batch_size, hidden_size)**, which is what you want to feed into your linear layer if it's set up to accept **hidden_size** features as input.

## Using **`outpu[-1]`**
※※ misused case

example
* output.shape -> Size(batch_size, seq_length, hidden_size)
* output.shape ->(10, 30, 50)
    * output[-1] -> (30, 50)
        - get last records of batch

* **Potential Misinterpretation**: This line of code might be interpreted as selecting the last element of the first dimension of **output**, which is not what you intend here. The confusion arises because when **batch_first=True** is not considered, or in the context of the default LSTM output shape **(seq_length, batch_size, hidden_size)**, **output[-1]** would indeed give you the last time step's output for all sequences in the batch, but in the shape **(batch_size, hidden_size)**.
* **Actual Outcome**: However, given that **batch_first=True** is specified, and assuming the intention to select the last time step across all sequences (which **output[-1]** does not correctly address in this context), using **output[-1]** without proper indexing might lead to unexpected results or errors, especially if the indexing does not align with the expected dimensions.

## Correct Approach for the Given Example
For the privided LSTM configuration **(batch_first=True)**, to correctly <u>select the last time step's output across the entire batch</u> and maintain consitency with the expected input shape for the linear layer, you should use **output[:, -1, :]**. This approach ensures that the tensor fed into the **linear**layer has the correct shape **(batch_size, hidden_size)**.


Thus, the second example, which attemps to use **output[-1]**, might not perform as intended due to indexing that does not appropriately account for **batch_first=True**. It's critical to ensure the indexing matches the data's actual structure and the model's configuration to avoid errors and ensure that the model learns from the correct features.

# no specific sequence lenght in MSTM of keras
In Keras, if you do not specify the  **batch_size** and **input_length** in the **LSTM** layer, the model is able to take input of any length. This is known as None or undefined sequence length, and it allows the model to process input sequences of varying lengths.


In PyTorch, LSTM layers can also accept sequences of any length, but it is hadled a bit differently. PyTorch's LSTM expects the input data to be a tensor of shape **(seq_len, batch, input_size)** where **seq_len** is the length of the sequence, **batch** is the batch size, and **input_size** is the number of features in the input. However, **seq_len** can vary from batch to batch if needed.

# Flatten() to output from LSTM
By **Flatten()**, output data from LSTM transform 3D to 2D and when output data changed to 2D, all sequence data is included by **(batch_size, seq_len, num_hidden_units) -> Flatten() -> (batch_size, input_units)**

In [None]:
# nn.Flatten() edition
class PortfolioModel(nn.Module):
    def __init__(self, input_size, outputs):
        super(PortfolioModel, self).__init__()
        self.lstm = nn.LSTM(input_size, 64)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(64, outputs)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        flattened = self.flatten(lstm_out)
        output = self.fc(flattened)
        return self.softmax(output)

#-------------------------------------------------
# Following code assume just 1 batch, for simplicity.
# instead of Flatte()
def forward(self, x):
    lstm_out, _ = self.lstm(x)
    # Reshape from (seq_len, batch, features) to (batch, seq_len * features)
    # Assuming batch size is 1 for simplicity
    lstm_out = lstm_out.view(-1, self.num_hidden_units * x.size(0))
    output = self.fc(lstm_out)
    return self.softmax(output)

In [None]:
# use last sequence data edition
class PortfolioModel(nn.Module):
    def __init__(self, input_size, outputs):
        super(PortfolioModel, self).__init__()
        self.lstm = nn.LSTM(input_size, 64)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(64, outputs)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        output = self.fc(lstm_out[: , -1, :])
        return self.softmax(output)

The LSTM layer in PyTorch outputs a tensor with the shape **(seq_len, batch, num_features)**, and when you apply **nn.Flatten()**, it flattens the last two dimensions, effectively reshaping the tensor to **(seq_len, batch * num_features)**, which can then be fed into the fully connected layer.


## Using whole sequence data from LSTM by flatten(), instead of out[:, -1, :]

To modify my LSTM model in PyTorch so that the entire sequence output from the LSTM for linear transformation, rather than just the final sequence data, you will need to adjust how you handle the output shape from th LSTM. Since your LSTM's output **out** is of shape **[batch_size, sequence_length, hidden_size]** (with values **[64, 50, 64]** in your case), and you want to apply the **`nn.Linear`** module to every time step's output, you'll need to flatten the last two dimensions before passing them to the linear layer.


Here is how you can modify my model to incorporate this change using **`nn.Flatten`** and reshaping:

### Step-by-Step Modification
1. **Flatten the Sequence**: You will reshape(flatten) the **sequence_length** and **hidden_size** dimensions into a single dimension that can be fed into the **`nn.Linear`** layer. This transformation can be handled by first reshaping output of the LSTM.
2. **Modify the LInear Layer**: Ensure that the **`nn.Linear`** layer's input features match the product of **`sequence_length`** and **`hidden_size`** since each sequence point's hidden state will be considered independently by the linear layer.


Here's your modificated model code:

In [None]:
import torch
import torch.nn as nn

class LSTMSharpeRatio(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, sequence_length):
        super(LSTMSharpeRatio, self).__init__()
        self.lstm = nn.LSTM(input_size=8, hidden_size=64, batch_first=True)
        # Adjust the number of input features to the linear layer
        self.fc = nn.Linear(sequence_length * hidden_size, output_size)
        self.activation = nn.Softmax(dim=1)

    def forward(self, x):
        # x shape: [batch_size, sequence_length, input_size]
        out, _ = self.lstm(x)
        # Reshape(Flatten) the output to [batch_size, sequence_length * hidden_size]
        out = out.reshape(out.size[0], -1)
        out = self.fc(out)
        return self.activation(out)


**Key Change**

* **Reshape Operation**: **`out.reshape(out.shape[0], -1)`** flattens the LSTM output from **`[64, 50, 64]`** to **`[64, 3200]`** (since 50*64=3200), where 3200 is the new feature size per batch element that goes into the linear layer.
* **Adjustment in `nn.Linear`**: The **`nn.Linear`** is set to take **`hidden_size * sequence_length`** as the input feature size to accommodate the flattened output fro m all sequence.


This approach assumes that you want to maintain the temporal structure flat across all timesteps, thus mixing all the temporal information together before the final classification/regression task. This can be useful if your downstream task benefits from considering the entire sequnece at once. However, it's also common to use approaches like adding another LSTM layer, applying pooling over timesteps, or using only the last timestep, depending on the nature of the taks and the type of information needed from the sequence data.

## Different between Flatten() and lstm_out[:, -1, :]
Using **lstm_out[:, -1, :]** is a common way to deal with LSTM outputs in sequence prediction tasks where you're only interested in the output of the last time step of the LSTM. This is because an LSTM processes a sequence and produces an output at every time step, but often for sequence prediction, your only care about the final output that summarizes the information of the entire sequence.

The slicing operation **lstm_out[:, -1, :]** selects the output of the last time step for each instance in the batch. It effectively reduces the output tensor from a shape of **[batch_size, sequence_length, features]** to **[batch_size, features]**, where **features** corresponds to the number of hidden units in the LSTM.

This is how you would modify your forward method to use the last output of the LSTM:

```
def forward(self, x):
    lstm_out, _ = self.lstm(x)
    # Use only the output from the last time step
    last_time_step_out = lstm_out[:, -1, :]
    output = self.fc(last_time_step_out)
    return self.softmax(output)
```

Here's what happends in this code:
* The LSTM processes the entire sequence and outputs a tensor of shape **[batch_size, sequence_length, num_hidden_units]**.
* The slicing operation extracts the outputs of the last time step for all instances in the batch. the resulting tensor has a shape of **[batch_size, num_hidden_units]**.
* This tensor is then passed to the fully connected layer(**self.fc**) without the need for flattening since it's already 2D.

When you only use the last time step's output, you're assuming that this final output vector contains all the information you need from the sequnece to make the prediction. This is a reasonable approach in many cases ,especially when the sequence's temporal order is important and you want to predict an outcome based on the entire sequence up to the point.


However, it's important to note that by taking just the last output, you are discarding all the other outputs from the LSTM. This is not equivalent to flattening. Flatteining would retain all information from all time steps and may lead to a model with higher capacity (though not necessarily better performance) since it has more data to work with. By selecting only the last output, you're making a deliverate decision to use less information, which can be beneficial or detrimental(有害な) depending on the problem you're solving.


Q: So, in Keras code, by using `Flatten()`,  this model use all sequence's data. Is that right?

A: In the Kears code, the **Flatten()** layer is indeed used to flatten the entire output of the LSTM layer. This means that the model uses all the data from all the timesteps in the sequence.


When an LSTM layer's output is passed through a **Flatten()** layer, it takes the output tensor of shape **(batch_size, seq_len, num_hidden_units)** and reshapes it into 2D tensor of shape **(batch_size, seq_len * num_hidden_unites)**. This operation retaions all the information from every timestep of the sequence, which is then passed on to the subsequent **Dense** layer.


Here's breakdown of what's happening:
* The LSTM processes each timestep of the input sequence, undating its hidden state at each step.
* The output for each timestep is collected into an output tensor.
* The **Flatten()** layer takes this tensor and reshapes it, concatenating the outputs from all timesteps into a long vector for each sample in the batch.
* The **Dense** layer then operates on this flattened vector, effectively considering information from the entire sequence.



This approach is used when you believe that all timesteps contribute useful information for the downstream task (e.g. making a prediction based on the entire sequence). However, it can significantly increase the number of parameters in the **Dense** layer since the input size becomes **seq_len * num_hidden_units**, which might lead to overfitting if not regularized properly or if insufficient training data is available.

# How to write PyTorch code when not specify sequence lenght for LSTM

In Pytorch, the LSTM layer can accept sequences of any lenght without needing to secify the sequence length beforehand. The LSTM layer in PyTorch is flexible and can handle varying sequence lengths within the same batch if packed sequences are used.


Here's how you can define and use an LSTM in PyTorch without specifying a fixed sequence length:

In [None]:
import torch
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True) # Batch first means input should be (batch, seq, feature)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # x of shape (batch, seq, feature), seq can vary !!!!!
        lstm_out, _ = self.lstm(x)
        # Take the last output for each sequence
        last_out = lstm_out[:, -1, :]
        output = self.fc(last_out)
        return output


When using this LSTM model, you can feed it bachtchs of data where each sequence can have a different length. However, within each batch, you need to pad your sequences to the same length or use packed sequences for varying lenghts.


To handle sequences of different lengths in a batch more efficiently, you can use
**torch.nn.utils.rnn.pack_padded_sequence**

and

**torch.nn.utils.rnn.pad_packed_sequence**.

Here's an example of how to use them:


In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class LSTMModelVariableSeq(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModelVariableSeq, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, lengths):
        # Pack the sequence
        packed_x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
        packed_lstm_out, (h_n, c_n) = self.lstm(packed_x)
        # Unpack the sequence
        lstm_out, _ = pad_packed_sequence(packed_lstm_out, batch_first=True)
        # We take the output state from the last LSTM unit after unpacking (h_n[-1] could also be used)
        last_out = lstm_out[range(len(lstm_out)), lengths - 1, :]
        output = self.fc(last_out)
        return output


To use **LSTMModelVariableSeq**, you have to provide the lenghts of each sequence when you call the model. this allows the LSTM to know which part of the data is actual sequence data and which part is padding, so it can compute the outputs accurately. The data should be sorted by sequence length in a decreasing order if **enforce_sorted** is True, which is the default behavior. If you set **enforce_sorted=False**, then PyTorch will sort the sequences for you internally.

This model can handle variable-length sequences in a very efficient way by ignoring the padding during the LSTM computations.

## What is **`Padding`**
In the context of LSTM networks ,"padding" refers to a common technique used to handle input sequences of varying lengths within the same batch. Since most neural network architectures (including LSTM) reuquire fixed-size iputs, padding is used to fill shorter sequences with dummy values (typically zeros) so that all sequences in the batch have the same length.

Here's detailed explanation of why and how padding is used, especially in relation to LSTM models:

### Why Use Padding?
In may applications involving sequences (like text, time series, etc.), the lengths of the sequences can vary. For instance, sentences in natural language processing have different lengths. However, to process these sequences in batches using deep learning frameworks like PyTorch, all sequences in a batch must be the same length for matrix operations to be computationally feasible.


### How Padding Works:
* **Equalizing Sequence Lengths**: Padding involves appending zeros (or another placeholder value) to the end of shorter sequences to match the length of the longest sequence in the batch. This results in a uniform input shape.
* **Example**:
    * Suppose you have three sequences of lengths 5, 8, and 3. To process them in a single batch, you would pad the first sequence to length 8 (adding 3 zeros), and the third sequence to length 8 (adding 5 zeros).


### Using Padding Sequences in LSTMs:
* **Ignored by LSTMs**: The padded values (zeros) are meant to be neutral in the computations and should not influence the learning process. LSTMs and other recurrent neural networks can be informed to ignore these padded parts during training to avoid skewing the results. This is done using mechanisms like packed sequences in PyTorch.
* **Packed Sequences**: In PyTorch, **packed_padded_sequence** is used to convert a padded batch of sequences into a packed sequence. this packed format contains information about the original lengths of the sequences, which the LSTM uses to dynamically ignore the padding during the forward and backward passes.
* **Benefits of Padding**:
    * **Efficiency**: The LSTM only processes the actual data points and skips the padding, reducing computation time and improving model performance.
    * **Accuracy**: By skipping the padded areas, the model only learns from the real data, thereby maintaining the accuracy and integrity of the model's learining process.


### Practical Usage:
When implementing an LSTM with padding sequences, you need to:
1. Pad your sequences to the same length.
2. Pass the original lengths of each sequence to the LSTM using **pack_padded_seuqence**.
3. The LSTM will then process the sequences, automatically ignoring the padded parts.


By effectively managing padding and understanding how it interacts with LSTMs, you can handle variable-lenght sequences in batch processing efficiently, which is crucial in many deep learning applicaions involving sequence data.