# Dataset class in PyTroch

The `Dataset` class in PyTorch is a fundamental component for management for managing and utilizing large datasets. It provides a standard way to create iterable datasets and is especially useful when you're dealing with complex data types like time series. When working with time series data, the goal is often to predict future values based on past observations, which requires the data to be sequenced into fixed-length inputs for the model.

## Understanding the `Dataset` Class
The **Dataset** class is an abstract class in PyTorch, meaning you typically subclass it to implement custom behavior for loading and processing your data. The two primary methods you need to override are:


* **__len__(self)**: Return the size of the dataset.
* **__getitem__(self, index)**: Retrieves an item at the specific index.

## Implementing a Timse Series **`Dataset`**
Here's a step-by-step guide on how to subclass **Dataset** for a time series dataset:


1. **Define Your Dataset Class**: Extend the **Dataset** class to fit your specific needs, implmenting the required **__len__** and **__getitem__** methods.
2. **Prepare Your Data**: For time series data, you usually have sequences of values over time. You'll need to decide how to split theses sequences for training, such as using a sliding window approach to create fixed-length subsequences.
3. **Implement __getitem__**: This method should return a single training sample (and possibly a label, for supervised tasks) from your dataset at a specified index. For time series, this might mean <u>returning a sequence of values as your input and the next value in the series as your label</u>.

<br><br>
Let's look at a concreate example of how you might set up a **Dataset** for a simple univariate time series task, where the goal is to predict the next value based on a sequence of previous values:

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class TimeSeriesDataset(Dataset):
    def __init__(self, sequences, sequence_length):
        self.sequences = sequences # dataが入る
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.sequences) - self.sequence_length

    def __getitem__(self, index):
        # Fetch a sequence and the next value (label)
        sequence = self.sequences[index:index+self.sequence_length]
        label = self.sequences[index+self.sequence_length]
        return torch.tensor(sequence), torch.tensor(label)

# Example usage
sequence_length = 5
# Assuming `data` is your time series data, a 1D tensor (or numpy array) of sequential values
data = torch.randn(100) # Example data

dataset = TimeSeriesDataset(data, sequence_length)
dataloader = DataLoader(dataset, batch_size=10)

for sequence, label in dataloader:
    # Each `sequence` is a batch of sequences, and `label` is the next value
    print(sequence.shape)
    print(label.shape)
    print(sequence)
    print(label)
    break

torch.Size([10, 5])
torch.Size([10])
tensor([[-1.6335,  0.3526, -0.3992,  0.2826, -0.7692],
        [ 0.3526, -0.3992,  0.2826, -0.7692,  0.8167],
        [-0.3992,  0.2826, -0.7692,  0.8167,  0.5386],
        [ 0.2826, -0.7692,  0.8167,  0.5386, -0.9178],
        [-0.7692,  0.8167,  0.5386, -0.9178,  0.7416],
        [ 0.8167,  0.5386, -0.9178,  0.7416, -1.2861],
        [ 0.5386, -0.9178,  0.7416, -1.2861, -0.1797],
        [-0.9178,  0.7416, -1.2861, -0.1797,  1.1192],
        [ 0.7416, -1.2861, -0.1797,  1.1192, -2.5261],
        [-1.2861, -0.1797,  1.1192, -2.5261,  0.6301]])
tensor([ 0.8167,  0.5386, -0.9178,  0.7416, -1.2861, -0.1797,  1.1192, -2.5261,
         0.6301,  1.4682])


  return torch.tensor(sequence), torch.tensor(label)


In [None]:
data[:20]

tensor([-1.6335,  0.3526, -0.3992,  0.2826, -0.7692,  0.8167,  0.5386, -0.9178,
         0.7416, -1.2861, -0.1797,  1.1192, -2.5261,  0.6301,  1.4682,  0.4238,
        -0.1102, -1.1154, -0.2161,  0.3630])

In this example, `TimeSeriesDataset` takes a time series andd a sequence length as input. The **__geteitem__** method returns a slice of the time series data of the specified length(**sequence_length**) and the next value in the series as the label. The **DataLoader** can then be used to iterate over the dataset in bathces, which is especailly useful for training neural networks on the data.


This approach is quite flexible and can be adapted to various types of time series data and tasks, such as multivariate forecasting, classification, and more, by adjusting how sequences and labels are generated and returned by **__getitem__**.

## \_\_len__ function


**The `__len__` function in PyTorch Datasets**

In PyTorch, the `__len__` function plays a crucial role within custom datasets you create using `torch.utils.data.Dataset`. It's a special method inherited from Python's built-in `len()` function and serves the following key purpose:

**1. Determines Dataset Length:**

- When you use a custom dataset with a DataLoader object (e.g., `data_loader = torch.utils.data.DataLoader(my_dataset, batch_size=32)`), PyTorch calls the `__len__` method on your dataset instance.
- The `__len__` method's responsibility is to return an integer value that represents the total number of samples (data points) in your dataset.
- This information is essential for the DataLoader to effectively iterate through your dataset during training or evaluation. It allows the DataLoader to:
    - Split the dataset into batches of the specified size (provided by the `batch_size` argument).
    - Determine the number of iterations required to process the entire dataset.

**Example:**

In [None]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)  # Assuming 'data' is a list or other iterable

# Usage
my_dataset = MyDataset(...)
data_loader = torch.utils.data.DataLoader(my_dataset, batch_size=64)

In this example, the `__len__` method simply returns the length of the `self.data` list, which is assumed to hold your dataset samples.

**Key Points to Remember:**

- The `__len__` method should be a straightforward implementation that reflects the actual size of your dataset.
- For datasets with a finite number of samples, it's typically implemented by returning the length of an underlying list, array, or other iterable containing your data.
- If your dataset represents an infinite or dynamically generated data stream, you might need to return a pre-defined large value (e.g., `sys.maxsize`) or implement a custom logic to determine the dataset size on the fly. However, exercise caution in such scenarios, as infinite data streams might not be suitable for all use cases with PyTorch dataloaders.

**In essence, the `__len__` function acts as a contract between your custom dataset and PyTorch's DataLoader, enabling efficient data iteration and batching during training and evaluation.**

### \_\_len__ function to DataLoader
When creating a PyTorch Dataset class for LSTM sequence data, the `__len__` function needs to consider the way you're structuring your sequences. Here's how you can set it up along with an example:

**Understanding Sequence Structure:**

There are two common approaches to representing sequence data for LSTMs in PyTorch datasets:

1. **Single Sequence per Sample:**
   - Each sample in your dataset is a single, complete sequence.
   - The `__len__` function simply returns the total number of sequences in your dataset.

2. **Sequences as Subsets of a Larger Data Source:**
   - You might have a larger data source (e.g., entire file, list of all data points), and your dataset extracts subsequences of a fixed or variable length for training.
   - The `__len__` function needs to account for how many subsequences you can create from the data source.

**Example Code (Single Sequence per Sample):**

In [None]:
class LSTMSequenceDataset(torch.utils.data.Dataset):
    def __init__(self, sequences, labels):
        self.sequences = sequences
        self.labels = labels

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        label = self.labels[idx]
        # Convert sequence and label to tensors (if needed)
        return sequence, label

- This example assumes you have two lists: `sequences` (containing individual sequences) and `labels` (corresponding labels for each sequence).
- The `__len__` function returns the length of the `sequences` list, representing the number of samples in the dataset.

**Example Code (Sequences as Subsets):**

In [None]:
class LSTMSequenceSubsetDataset(torch.utils.data.Dataset):
    def __init__(self, data, sequence_length):
        self.data = data
        self.sequence_length = sequence_length

    def __len__(self):
        # Assuming 'data' is a long list (larger than sequence_length)
        return (len(self.data) - self.sequence_length) + 1  # Account for all possible subsequences

    def __getitem__(self, idx):
        sequence = self.data[idx:idx + self.sequence_length]
        # Convert sequence to tensor (if needed)
        return sequence

- This example assumes a larger data source (`data`) from which you extract subsequences of a fixed length (`sequence_length`).
- The `__len__` function calculates the maximum number of possible subsequences by subtracting the sequence length from the total data length and adding 1 (to account for the first possible subsequence starting at index 0).

**Remember to adapt these examples to your specific data structure and sequence processing logic.**

By correctly implementing the `__len__` function, you ensure that your PyTorch DataLoader can efficiently iterate through your LSTM sequence data during training and evaluation.

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/lfranceschetti/data_science_project">https://github.com/lfranceschetti/data_science_project</a></li>
  </ol>
</div>

# Batch precessing

In deep learning, the term "batch processing" refers to the technique of training a neural network using batches of input data at a time, rather than feeding the entire dataset or single data points into the model at once. This approach offers a balance between the computational efficiency of processing many data points at once and the memory constraints that prevent loading the entire dataset into memory.

### Key Points About Batch Processing:

- **Batch Size**: This is the number of samples of data processed before the model's internal parameters (weights) are updated. It's a crucial hyperparameter that can affect the performance and efficiency of the learning process. Common batch sizes include 32, 64, 128, etc., though the optimal size can vary depending on the specific task and hardware constraints.

- **Types of Batches**:
  - **Mini-Batch**: The most common approach, where the dataset is divided into small batches. It combines the advantages of both batch and stochastic gradient descent methods.
  - **Full Batch**: The entire dataset is processed at once. This is rarely used in practice due to memory limitations and less efficient training dynamics.
  - **Stochastic**: A special case where the batch size is 1, meaning the network is updated after every single sample. It's highly efficient in terms of memory but can lead to a lot of noise in the training process.

- **Advantages**:
  - **Efficiency**: Batch processing is more computationally efficient than stochastic methods, as it can leverage vectorized operations and parallel processing.
  - **Generalization**: By averaging the gradient over a batch, it can smooth out some of the noise in the training data, potentially leading to better generalization.
  - **Memory Management**: It allows for training on datasets that are too large to fit into memory all at once.

- **Disadvantages**:
  - **Memory Requirement**: Larger batch sizes require more memory, which can be a limiting factor on some hardware.
  - **Hyperparameter Tuning**: Finding the optimal batch size can be a process of trial and error, as it depends on the specific dataset and model architecture.

In practice, selecting the right batch size is a balance between training speed, memory limitations, and the stability of the convergence process. It's often determined empirically, as part of the model tuning process.

# DataLoader class

In PyTorch, the `DataLoader` class is a flexible and efficient way of iterating over a dataset. When you use a `DataLoader` to fetch batches of data, it returns each batch as a tuple containing two main elements: the input data and the labels. These elements are organized into tensors.

Here's a breakdown of the output structure you can expect from iterating over a `DataLoader` instance:

- **Batch of Input Data**: This is typically a tensor (or a collection of tensors if your dataset returns multiple inputs) that contains a batch of input samples. The shape of this tensor usually follows the pattern `(batch_size, feature_dimensions...)`, where `batch_size` is the number of samples in the batch, and `feature_dimensions...` represents the dimensions of the input features. For example, in the case of images, this might be `(batch_size, channels, height, width)`.

- **Batch of Labels**: This is a tensor containing the labels corresponding to each input sample in the batch. The shape of the labels tensor often depends on the type of problem you're working on. For classification tasks, it might be a 1D tensor of size `batch_size`, where each entry is the label index for the corresponding input sample. For regression tasks, it could be a tensor of shape `(batch_size, target_dimensions)` if you're predicting multiple values per sample.

Here's a simple example to illustrate how you might use a `DataLoader` in PyTorch:

```python
from torch.utils.data import DataLoader, TensorDataset
import torch

# Dummy dataset with 100 samples, each sample is a 10-dimensional vector
inputs = torch.randn(100, 10)
# Dummy labels, one label per sample
labels = torch.randint(0, 2, (100,))

# Create a TensorDataset and DataLoader
dataset = TensorDataset(inputs, labels)
dataloader = DataLoader(dataset, batch_size=20)

# Iterate over the DataLoader
for batch_idx, (data, target) in enumerate(dataloader):
    print(f"Batch {batch_idx}:")
    print(f" - Data shape: {data.shape}")  # Should be torch.Size([20, 10])
    print(f" - Target shape: {target.shape}")  # Should be torch.Size([20])
```

In this example, each iteration over the `DataLoader` yields a batch where `data` is a tensor of input samples with shape `(20, 10)` (since we specified `batch_size=20`), and `target` is a tensor of labels with shape `(20,)`.

It's also worth noting that `DataLoader` can handle more complex data structures through custom `Dataset` classes, allowing for much flexibility in terms of what each batch can contain (e.g., images, text, additional metadata, etc.).

# How to create sequence data in PyTorch

The **DataLoader** in PyTorch itself does not automatically create sequences from your data. It is designed to efficiently load and iterate over datasets, handling batching, sampling, shuffling, and multiprocessing seamlessly. However, the resposibility of defining how the data should be structured into sequences falls on the dataset being passed to the **DataLoader**.


To work with sequential data, like time series or text, you typically need to preprocess your data into the desired sequence format before feeding it into a **DataLoader**. <u>This involves creating a custom **Dataset** class that takes your raw data and transforms it into sequences of the desired length</u>.


Here's a simple outline of how might do this:

1. **Define a Custom Dataset**: Subclass **torch.utils.data.Dataset** to create a dataset that returns data in the sequence format you need. This involves implementing the **__init__**, **__len__**, and **__getitem__** methods to handle your data's loading, length reporting, and item acessing, respectively.
2. **Preprocess Data into Sequences**: In the **__getitem__ method, <u>you can define logic to convert your data into sequences. For example, if you're working with time series data, you could create sequences of a specific length based on the time steps</u>.
3. **Use DataLoader**: Once you have a **Dataset** that outputs data in the correct sequence format, you can pass it to a **DataLoader** to handle batching and further processing like shuffling or parallel loading.


Here's a basic example of what this might look like:

In [None]:
### 5つの銘柄について過去1週間のシーケンスデータにて5銘柄の翌日の値を予測するためのdataloader

from torch.utils.data import Dataset, DataLoader
import torch

class sequenceDataset(Dataset):
    def __init__(self, data, sequence_length):
        # ここでは、クラスにデータを渡すだけなので、xとyに分ける必要性は無い
        # ただし、わかりやすくするためにここで分けても問題ない
        self.data = data
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.data) - self.sequence_length

    def __getitem__(self, index):
        # 特徴量5の次の値を予測するので、targetの戻り値は次の5特徴量の値になる
        # ここで、sequence data を都度作成して返す
        return (self.data[index:index+self.sequence_length],
                self.data[index+self.sequence_length])

# Example usage
data = torch.randn(100, 5)  # Example data: 100 samples, each with 5 features
sequence_length = 7
dataset = sequenceDataset(data, sequence_length)
dataloader = DataLoader(dataset, batch_size=20)

for batch_idx, (data, target) in enumerate(dataloader):
    print(f"Batch {batch_idx}:")
    print(f" - Data shape: {data.shape}")  # Expected to be torch.Size([20, 7, 5])
    print(f" - Target shape: {target.shape}")  # Expected to be torch.Size([20, 5])

Batch 0:
 - Data shape: torch.Size([20, 7, 5])
 - Target shape: torch.Size([20, 5])
Batch 1:
 - Data shape: torch.Size([20, 7, 5])
 - Target shape: torch.Size([20, 5])
Batch 2:
 - Data shape: torch.Size([20, 7, 5])
 - Target shape: torch.Size([20, 5])
Batch 3:
 - Data shape: torch.Size([20, 7, 5])
 - Target shape: torch.Size([20, 5])
Batch 4:
 - Data shape: torch.Size([13, 7, 5])
 - Target shape: torch.Size([13, 5])


In [None]:
target
# (batch_size, 5_stock_prices)
# この場合は、5つの特徴量(銘柄)の株価を予測する - multi forecasting

tensor([[ 0.2391,  1.8495, -0.1951,  2.0985,  1.3999],
        [ 1.2587, -0.5700, -0.6957,  0.9828, -0.3854],
        [-1.1010,  0.3142,  0.3250,  2.0819,  2.0572],
        [-1.5021, -0.0376,  1.2581, -0.0857, -0.6342],
        [ 0.6008, -0.6819,  0.7142, -1.5600,  0.1330],
        [-0.4801, -1.3448,  0.2064,  0.2305,  1.3192],
        [-0.7040,  0.1032, -0.6216, -1.4984, -0.4733],
        [ 0.7143,  0.9961, -1.5333,  0.1347,  1.6404],
        [ 1.1530,  0.7484,  0.2664,  0.9285,  1.4625],
        [-0.4010,  0.5652, -1.3499,  1.0218, -1.3819],
        [-1.6391,  1.1850, -0.4324, -1.1582, -1.4239],
        [ 2.1294,  2.1211,  0.2142, -0.5402, -0.0328],
        [-0.1397,  2.2330, -0.3432,  0.9062,  1.4824]])

In this example, `SequenceDataset` is designed to take a dataset of continuous data and generate sequences of a specified length (**sequence_length**). The **DataLoader** then handles these sequences, providing batches of them for training or inference.

## conclusion of sequencing

When modeling with sequence data in PyTorch, especially when using the `Dataset` and `DataLoader` classes for handling your data, you generally need to implement the logic for generating sequences within the `__getitem__()` method of your custom `Dataset` class.

The `__getitem__()` method is called by the `DataLoader` for each index requested. By defining how data is transformed into sequences within `__getitem__()`, you ensure that each call retrieves a correctly formatted sequence along with its associated label (if applicable). This setup allows you to dynamically convert your dataset into sequences on-the-fly during the training or inference process, making it a powerful approach for working with sequential data such as time series, sentences, or any ordered sequence of data points.

Here's a brief overview of the steps involved:

1. **Prepare Your Raw Data**: This could be anything from a series of measurements over time, text data, or any sequential data where the order of data points matters.

2. **Define Your Custom Dataset Class**: Subclass `torch.utils.data.Dataset` and implement the `__init__()`, `__len__()`, and `__getitem__()` methods. In `__getitem__()`, you include your logic to transform a portion of your raw data into a sequence.

3. **Generate Sequences in `__getitem__()`**: Within this method, slice your raw data into sequences of the desired length based on the index parameter. This is where you decide how to handle the start and end of your dataset, how to pair inputs with labels (if doing supervised learning), and any other preprocessing steps like normalization.

4. **Use DataLoader to Fetch Batches**: Pass your custom dataset to a `DataLoader` instance to easily fetch batches of sequences for training or evaluation. The `DataLoader` will handle the details of batching, shuffling (if desired), and parallel data loading.

By customizing the `__getitem__()` method, you have full control over how your data is presented to the model, allowing for a wide range of sequence-based tasks to be tackled efficiently.

## Example2: Dataset and DataLoader for one stock price pridiction by 4 features(week, CPI, GDP, FX)

In [None]:
## create dummy data
import pandas as pd
import numpy as np

# Assuming each feature is randomly generated for demonstration
np.random.seed(42)  # For reproducibility
data = pd.DataFrame({
    'Amazon stock price': np.random.rand(100) * 100,  # Dummy stock prices
    'week': np.arange(100) % 52,  # Weeks in a year, repeating
    'CPI': np.random.rand(100),  # Dummy CPI values
    'GDP': np.random.rand(100),  # Dummy GDP values
    'FX': np.random.rand(100)  # Dummy FX rates
})
data.head(3)

Unnamed: 0,Amazon stock price,week,CPI,GDP,FX
0,37.454012,0,0.031429,0.642032,0.051682
1,95.071431,1,0.63641,0.08414,0.531355
2,73.199394,2,0.314356,0.161629,0.540635


In [None]:
## Implement the Custom Dataset Class
import torch
from torch.utils.data import Dataset, DataLoader

class StockPriceDataset(Dataset):
    def __init__(self, dataframe, sequence_length=7):
        self.dataframe = dataframe
        self.sequence_length = sequence_length

        # Normalize your features as needed
        # Here, we're skipping normalization for simplicity

    def __len__(self):
        # Subtracting sequence_length to avoid out-of-bounds
        return len(self.dataframe) - self.sequence_length

    def __getitem__(self, index):
        # Extract the sequence of features
        features = self.dataframe[['week', 'CPI', 'GDP', 'FX']].iloc[index:index+self.sequence_length].values
        # Traget: Amazon stock price of the next day
        target = self.dataframe['Amazon stock price'].iloc[index+self.sequence_length]

        return torch.tensor(features, dtype=torch.float32), torch.tensor(target, dtype=torch.float32)

In [None]:
## Create the dataset
sequence_length = 7 # Previous 7 days
dataset = StockPriceDataset(data, sequence_length)

# Create the DataLoader
batch_size = 20
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # Shuffle for training

# Example of iterating over the DataLoader
for features, target in dataloader:
    print(f"Features shape: {features.shape}")
    print(f"Target shape: {target.shape}")
    print()

Features shape: torch.Size([20, 7, 4])
Target shape: torch.Size([20])

Features shape: torch.Size([20, 7, 4])
Target shape: torch.Size([20])

Features shape: torch.Size([20, 7, 4])
Target shape: torch.Size([20])

Features shape: torch.Size([20, 7, 4])
Target shape: torch.Size([20])

Features shape: torch.Size([13, 7, 4])
Target shape: torch.Size([13])



In this setup, **features** for each batch will have a shape of **[batch_size, sequence_length, num_features]** (num_features=4 in this case, as 'week', 'CPI', 'GDP', 'FX' are used for prediction), and **target** will have a shape of **[batch_size]**, representing the Amazon stock price you're trying to predict for each sequence.


This framework gives you a starting point for developing a model to forecast stock prices based on the past 7 days of data for the given features. Remenber to adjust the normalization and data preprocessing steps according to your specific needs.

# Create batch shape by DataLoader

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)  # For reproducibility

# Generate a DataFrame with datetime information
num_hours = 365 * 24  # A year's worth of hourly data
date_rng = pd.date_range(start='1/1/2020', end='31/12/2020', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['weekday'] = df['date'].dt.weekday
df['hour'] = df['date'].dt.hour
df['season'] = df['date'].dt.month % 12 // 3 + 1

# Generate synthetic features and target variable
for i in range(7):  # Additional 7 features
    df[f'feature_{i}'] = np.random.rand(len(df))
df['electricity_consumption'] = np.random.rand(len(df)) * 100  # Target variable

# Placeholder split logic (actual logic may vary based on time series considerations)
train_df = df[:int(0.8*len(df))]
test_df = df[int(0.8*len(df)):]

train_df.head(3)

Unnamed: 0,date,weekday,hour,season,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,electricity_consumption
0,2020-01-01 00:00:00,2,0,1,0.37454,0.671368,0.40998,0.421576,0.137686,0.120749,0.616654,1.923384
1,2020-01-01 01:00:00,2,1,1,0.950714,0.523158,0.838483,0.280547,0.260339,0.520433,0.003229,47.550482
2,2020-01-01 02:00:00,2,2,1,0.731994,0.898639,0.185176,0.895044,0.48954,0.095159,0.792586,26.352564


In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler

class TimeSeriesDataset(Dataset):
    def __init__(self, dataframe, input_steps, forecast_steps, scaler):
        """
        Initialization of the dataset with a pre-fitted scaler.
        input_steps: encoder_lenght
        forecast_steps: forecast_length
        scaler: MinMaxScaler.fit()などすでにfit済みのscaler
        """
        self.input_steps = input_steps
        self.forecast_steps = forecast_steps
        self.scaler = scaler

        # Separate features and target
        features = dataframe.drop(columns=['electricity_consumption'])
        target = dataframe[['electricity_consumption']]

        # Transform features using the already fitted scaler
        self.features = self.scaler.transform(features)
        self.target = target.values # Numpyにして渡す ※shape(レコード数, 1)の2次元データ

    def __len__(self):
        return len(self.features) - self.input_steps - self.forecast_steps

    def __getitem__(self, idx):
        X = self.features[idx:idx+self.input_steps]
        y = self.target[idx+self.input_steps:idx+self.input_steps+self.forecast_steps].flatten()
        # flattenで1次元にしている。（この1次元のデータをバッチ化して2次元にするのは、DataLoaderクラスで行われる）
        return torch.tensor(X, dtype=torch.float), torch.tensor(y, dtype=torch.float)


### flatten() in the __getitem__:
The use of **flatten()** in the **__getitem__** method of your **TimeSeriesDataset** class serves an important purpose: it ensures that the labels (targets for prediction) are in the correct shape for comparison against the model's predictions during the loss caluculation phase of training.

<br>

**UnderStanding the Shapes**

* **Model's Prediction Shape**: In the modified **ElecticityConsumptionModel**, the output predictions have a shape of **[batch_size, forecast_length]**. For instance, if you're predicting electricity consumptuon for the next 24 hours **(forecasting_length = 24)** for a batch of 20 samples **(batch_size = 20)**, the output predictions will have a shape of **[20, 24]**.
* **Target Labels Shape**: Idealy, the targete labels should match this shape exactly for proper loss computation. However, when slicing arrays or tensors, there's a risk of introducing or retaining an unnecesssary extra dimension, resulting in a shape like **[20, 24, 1]** instead of **[20, 24]**.

<br>

**The role of flatten()**

* **Flattening Labels**: By applying **flatten()**, you remove any extra dimensions in the labels, converting a potential shape of **[20, 24, 1]** to **[20, 24]**. This operation ensures that the labels are directly comparable to the model's output without dimension mismatch issues.
* **Why It's Necessary**: During the training phase, specifically in the loss calculation step, PyTorch expects the predictions and labels to have compatible shapes. A mismatch, such as an extra dimension in the labels, can lead to errors or incorrect loss calculations. Using **flatten()** (or similarly **squeeze()**) standardizes the shapes, facilitating correct and efficient tarining.

<br>

**Example**

Suppose your lables tensor initially has a shape of **[20, 24, 1]** due to how the data was sliced or prepared. this shape indicates that each of the 24 forecasted hours has been encapsulated in its own dimension (the extra **1**), which is unnecessary for comparison with the model's output. Flattening adjusts this to **[20, 24]**, aligning it with the prediction shape and allowing for correct loss computation.

In [None]:
encoder_length = 168  # 7 days of hourly records (= sequence length)
forecast_length = 24  # Predicting the next 24 hours

# Fit scaler on training features
scaler = MinMaxScaler()

# Drop the 'date' column along with 'electricity_consumption' to prepare features for scaling
features_train = train_df.drop(columns=['date', 'electricity_consumption'])
scaler.fit(features_train)

# When initializing your datasets, ensure the 'date' column is also excluded from the features
train_dataset = TimeSeriesDataset(train_df.drop(columns=['date']), encoder_length, forecast_length, scaler)
test_dataset = TimeSeriesDataset(test_df.drop(columns=['date']), encoder_length, forecast_length, scaler)

# set DataLoader
train_loader = DataLoader(train_dataset, batch_size=20, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=20, shuffle=False)

for X, y in train_loader:
    print(X.shape)
    print(y.shape)
    break

# By using 168 hours records with 10 features, we predict next 24 hours.
# 20 is the numbers of batch size

torch.Size([20, 168, 10])
torch.Size([20, 24])


### Data Structure of `self.target` in `__getitem__`
The data structure of **self.target** in the **__getitem__** method is initially determined by how it's set in the **__init__** method. Since **self.target** is assigned as **target.values**, wehere **target** is a DataFrame containing only the **electicity_consumption** column, **self.target** will be a 2D numpy array with shape **(n, 1)**, where **n** is the number of rows in the DataFrame. This shape corresponds to the total number of data points in your dataset for the target variable.


When you access **self.target** within **__getitem__**, for each item, you're slicing this array to get a portion of it based on **idx**, **input_steps**, and **forecast_steps**. This slicing operation for **y**:

```
y = self.target[idx+self.input_steps:idx+self.input_steps+self.forecast_steps].flatten()
```

This line takes a slice of **self.target**, corresponding to the forecast period, and then flatten it. The flattening operation changes its **<u>shape from a 2D array to a 1D array</u>**. Therefore, after flattening, if **forecast_steps** were 24, for example, **y** would have a shape of **(24, )**. The flattening is done because your target varialbe (**y**) for each sample is expected to be a 1D tensor representing the series of electricity consumption values you're trying to predict for the forecast period.


to summarize, before flattening, each slice of **self.target** that corresponds to a single **y** in **__getitem__** would have a shape like **(forecast_steps, 1)**, after flattening, its shape would be **(forecast_steps, )**.


In [None]:
df[['electricity_consumption']].values.shape

(8761, 1)

In [None]:
df[['electricity_consumption']].values.flatten().shape

(8761,)

### Where is batch size data created?
The batch dimension is not explicitly created within the **__getitem__()** method of a PyTorch **Dataset** class. Instead, the batching logic is hadled by the **DataLoader**, which wraps around the **Dataset**.

<br>

**__getitem__() metohd**:

* The **__getitem__()** method is responsible for retrieving a <u>single item</u> from the dataset. When you implement a custom dataset by subclassing PyTorch's **Dataset**, you define how a single sample of data is processed and returned by this method. In your case, for each index **idx**, **__getitem__()** returns a single sample (and its correspoding label or target) where both input(**x**) and target(**y**) are shaped according to the individual sample's requirements. For the target, this means a 1D tensor with th length equal to **forecast_steps**, as per your setup.

<br>

**DataLoader and Batching**:
* The **DataLoader** takes your **Dataset** instance and allows for easy iteration over the dataset in mini-batches. When you use a **DataLoader** with your dataset, it automatically gathers samples into batches. It does this by calling the **__getitem__()** method of your dataset multiple times to fetch individual samples and then stacking these samples together to form a batch.
* By default, <u>the **DataLoader** adds an extra dimension (the batch dimension) as the first dimension of the tensors</u> it creates. This means if your **__getitem__()** method returns a target tensor **y** with shape **(forecast_steps, )** for a single sample, and you set your **DataLoader**'s **batch_size** to **N**, the DataLoader will combine these individual samples into a batch where the shape of **y** in each batch will be **(N, forecast_steps)**. This is because it stacks **N** such 1D tensors along a new dimension, resulting in a 2D tensor.


For example, if you create a DataLoader with your **TimeSeriesDataset** like this:

```
dataset = TimseSeriesDataset(dataframe=df, input_steps=12, forecast_steps=24, scaler=scaler)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
```

For each iteration over the **dataloader**, it will yield batches where each **X** has the shape **(32, input_steps, number_of_features)** and each **y** has the shape **(32, forecast_steps)**, assuming **input_steps** is the length of the input sequence and **number_of_features** is the number of features per tiemstep.

<br>

This batching mechanism is crucial for training neural network efficiently, as it allows for parallel processing of multiple data smples, reducing trainig time and leveraging optimization techniques like mini-batch gradient descent.

# shuffle=True or False in time series data

When dealing with time series data in machine learning models, especially those designed to capture temporal dependencies like LSTM networks, the order of data points is crucial. Each data point in a time series is typically dependent on previous data points. Therefore, maintaining the chronological order of data is essential for the model to learn the underlying patterns effectively.

### Shuffle=False for Time Series

For time series datasets, it's generally recommended to set `shuffle=False` in the `DataLoader` for several reasons:

- **Preserve Temporal Order**: Time series forecasting relies on the sequence of data points being in their true chronological order. Setting `shuffle=False` ensures that batches of data fed into the model preserve this order, which is necessary for the model to learn from past observations to predict future values accurately.

- **Consistency Across Batches**: Keeping the data in sequence allows the model to potentially use information from the end of one batch to inform the beginning of the next, especially if there's overlap between the batches or if statefulness is considered in the model training.

- **Validation and Testing Integrity**: For validation and testing sets, it's particularly important to maintain the chronological order to accurately assess the model's performance on unseen data that follows the same temporal sequence as the training set.

### When Shuffle=True Might Be Used

However, there are scenarios in time series analysis where `shuffle=True` might be useful, particularly during certain types of training:

- **Non-Sequential Models**: If the model being trained does not rely on the sequential nature of the data (for example, if you're using time series data in a way that treats each point as independent), shuffling can help to reduce variance and make the model more robust.

- **Cross-Validation**: In some types of cross-validation schemes designed specifically for time series (like time series split cross-validation), data is still handled in a way that respects the time series nature but involves selecting different segments of the data for training and validation to ensure the model's generalizability.

### Conclusion

For most time series forecasting tasks, especially when using models that rely on understanding the sequence of data (like LSTMs), it's best practice to set `shuffle=False` in the DataLoader. This approach respects the inherent order and dependencies within the data, which are vital for the model to make accurate predictions.

# Tips: Output another example

Given your specific setup—stock price data with 5 features, a sequence length of 7, a batch size of 10, and using a DataLoader in PyTorch—the output structure from the DataLoader will be tailored to accommodate sequences of your data.

Here’s how the data structure can be conceptualized:

- **Input Data Tensor**: The shape of your input data tensor for each batch will reflect the batch size, the sequence length, and the number of features. Specifically, it will be `(batch_size, sequence_length, num_features)`, which translates to `(10, 7, 5)` in your case. This means each batch contains 10 sequences, each sequence is 7 time steps long, and each time step includes 5 features.

- **Labels Tensor**: The shape of the labels tensor depends on how you've structured your task (e.g., predicting the next value in the sequence, classifying the sequence, etc.). Assuming you're predicting the next day's stock price or a similar single-value output, and you have one label per sequence, your labels tensor for each batch might have the shape `(10,)`, assuming a single target value per sequence. If you're predicting multiple values or a more complex structure, the shape would adjust accordingly.

Here is a simplified example to illustrate this:

```python
from torch.utils.data import DataLoader, TensorDataset
import torch

# Simulating stock price data: 100 samples, each with 7 time steps and 5 features per step
inputs = torch.randn(100, 7, 5)  # Shape: [100, 7, 5]

# Assuming one target value per sequence for simplicity
# If your task involves predicting multiple future values or has a different structure, adjust accordingly
labels = torch.randn(100)  # Shape: [100]

# Create a TensorDataset and DataLoader
dataset = TensorDataset(inputs, labels)
dataloader = DataLoader(dataset, batch_size=10)

# Iterate over the DataLoader
for batch_idx, (data, target) in enumerate(dataloader):
    print(f"Batch {batch_idx}:")
    print(f" - Data shape: {data.shape}")  # Expected to be torch.Size([10, 7, 5])
    print(f" - Target shape: {target.shape}")  # Expected to be torch.Size([10])
```

In this scenario, each iteration over the `DataLoader` will give you a batch where:
- `data` is a tensor containing 10 sequences of stock price data, with each sequence being 7 days long and each day having 5 features, thus having a shape of `(10, 7, 5)`.
- `target` is a tensor containing 10 labels (one for each sequence in the batch), assuming a single target value per sequence. If your label structure is different (e.g., if you're predicting multiple future values for each sequence), the shape of `target` would vary accordingly.

# check Type if Pandas.DataFrame is OK or not for DataSet class.
* DataLoader consume pytorch.tensor, so, we need to transfrom pd.DataFrame into torch.tensor via numpy.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split, Dataset
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [None]:
# load data
data = pd.read_parquet('/content/drive/MyDrive/study_DeepLearning/論文実装/data/price_data_20240409.parquet')
# calc return
data['AGG_return'] = np.log(data['AGG'] / data['AGG'].shift(1))
data['DBC_return'] = np.log(data['DBC'] / data['DBC'].shift(1))
data['VTI_return'] = np.log(data['VTI'] / data['VTI'].shift(1))
data['^VIX_return'] = np.log(data['^VIX'] / data['^VIX'].shift(1))
# exclude nan
data = data.iloc[1:]

# Splitting into training and testing sets
train_data = data.loc[data.index<'2021-01-01']
test_data = data.loc[data.index>='2021-01-01']

# Initializing the MinMaxScaler
scaler = MinMaxScaler()

# Fitting the scaler to the training data and transforming training and testing data
train_scaled = scaler.fit_transform(train_data.iloc[:, :4])
test_scaled = scaler.transform(test_data.iloc[:, :4])

# # Converting the scaled data back to a DataFrame
train_data.iloc[:, :4] = train_scaled
test_data.iloc[:, :4] = test_scaled


class FinancialDataset(Dataset):
    def __init__(self, data, sequence_length=3):
        self.features = data
        self.sequence_length = sequence_length

    def __len__(self):
        return len(self.features) - self.sequence_length

    def __getitem__(self, idx):
        sequence = self.features[idx:idx+self.sequence_length].to_numpy()
        target_return = sequence[:, 4:] # four columns of return to calc sharpe ratio
        return torch.tensor(sequence, dtype=torch.float32), torch.tensor(target_return, dtype=torch.float32)

train_dataset = FinancialDataset(train_data)
train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=False)

In [None]:
x_list = []
y_list = []
for x, y in train_dataloader:
    x_list.append(x)
    y_list.append(y)

In [None]:
x.shape

torch.Size([1, 3, 8])

In [None]:
x_list[0]

tensor([[[ 0.3759,  0.3617,  0.1793,  0.0605, -0.0007, -0.0294, -0.0098,
           0.0413],
         [ 0.3743,  0.3589,  0.1821,  0.0502, -0.0005, -0.0043,  0.0072,
          -0.0575],
         [ 0.3762,  0.3651,  0.1813,  0.0541,  0.0006,  0.0094, -0.0019,
           0.0224]],

        [[ 0.3743,  0.3589,  0.1821,  0.0502, -0.0005, -0.0043,  0.0072,
          -0.0575],
         [ 0.3762,  0.3651,  0.1813,  0.0541,  0.0006,  0.0094, -0.0019,
           0.0224],
         [ 0.3695,  0.3531,  0.1821,  0.0507, -0.0021, -0.0184,  0.0021,
          -0.0192]]])

In [None]:
x_list[1]

tensor([[[ 0.3762,  0.3651,  0.1813,  0.0541,  0.0006,  0.0094, -0.0019,
           0.0224],
         [ 0.3695,  0.3531,  0.1821,  0.0507, -0.0021, -0.0184,  0.0021,
          -0.0192],
         [ 0.3718,  0.3431,  0.1803,  0.0572,  0.0007, -0.0156, -0.0045,
           0.0366]],

        [[ 0.3695,  0.3531,  0.1821,  0.0507, -0.0021, -0.0184,  0.0021,
          -0.0192],
         [ 0.3718,  0.3431,  0.1803,  0.0572,  0.0007, -0.0156, -0.0045,
           0.0366],
         [ 0.3666,  0.3378,  0.1840,  0.0423, -0.0016, -0.0084,  0.0094,
          -0.0860]]])

In [None]:
train_data.head(6)

Unnamed: 0_level_0,AGG,DBC,VTI,^VIX,AGG_return,DBC_return,VTI_return,^VIX_return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-02-07,0.375921,0.361714,0.179259,0.060503,-0.000699,-0.029352,-0.009784,0.041313
2006-02-08,0.374319,0.358932,0.182055,0.05017,-0.000499,-0.004264,0.007169,-0.057548
2006-02-09,0.376241,0.365053,0.18131,0.054113,0.000599,0.009358,-0.001907,0.022352
2006-02-10,0.369517,0.353089,0.182118,0.050714,-0.002099,-0.018373,0.002065,-0.019239
2006-02-13,0.371758,0.343072,0.180347,0.05724,0.0007,-0.015646,-0.004533,0.036617
2006-02-14,0.366635,0.337785,0.184044,0.042284,-0.001601,-0.008357,0.009441,-0.08599


# Dataset and TensorDataset

In PyTorch, `Dataset` and `TensorDataset` are both used to handle data, but they serve different purposes and have different levels of abstraction:

### Dataset
`Dataset` is an abstract class representing a dataset in PyTorch. To create your own custom dataset, you typically inherit from this class and implement at least two methods:
- `__len__`: which returns the size of the dataset (the number of items it contains).
- `__getitem__`: which supports the indexing such that `dataset[i]` can be used to get the ith sample from the dataset.

This abstract class is designed to be a flexible base for creating datasets where you might need to read files, apply transformations, and perform other custom operations. It does not contain any specific data handling itself but defines a protocol for accessing elements and their number.

Here is an example of creating a custom dataset:
```python
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data, transforms=None):
        self.data = data
        self.transforms = transforms

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        if self.transforms:
            sample = self.transforms(sample)
        return sample
```

### TensorDataset
`TensorDataset` is a specific implementation of the `Dataset` abstract class that wraps tensors. When initialized with one or more tensors, it provides a way to access slices of these tensors as samples. This is particularly useful when your dataset is already in memory, in the form of multiple large tensors, but you want to access it through the convenient Dataset interface used by PyTorch's `DataLoader` for batching, shuffling, etc.

`TensorDataset` assumes that each tensor you provide as input corresponds to a dataset dimension or feature, and it will index these tensors along the first dimension to form samples. This is ideal for situations where you have features and targets already loaded in tensors and you want to iterate through them efficiently.

Here is an example of using `TensorDataset`:
```python
from torch.utils.data import TensorDataset, DataLoader
import torch

# Example features and labels
features = torch.randn(100, 10)  # 100 samples, 10 features each
labels = torch.randn(100, 1)  # 100 samples, 1 target each

# Create TensorDataset
dataset = TensorDataset(features, labels)

# Create DataLoader for batching
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)

# Usage example
for batch_features, batch_labels in dataloader:
    # Process batches here
    pass
```

### Summary
- **Dataset**: An abstract class for making datasets available in PyTorch in a flexible and customizable way, requiring you to define how items are accessed.
- **TensorDataset**: A convenient class for handling datasets already loaded into tensors, allowing you to use them directly in the training pipeline with minimal setup.

Using `Dataset` allows for greater flexibility and is useful for handling complex data-loading scenarios, while `TensorDataset` is straightforward and efficient for datasets already in tensor form.