# Lesson Overview

Welcome to today's lesson on addressing data leakage in time series data while preparing it for machine learning. In this lesson, you'll learn the importance of maintaining temporal order in your dataset splits to avoid forward-looking bias, which can misleadingly inflate your model's performance. We'll be using the Tesla ($TSLA) stock data as an example. By the end of this lesson, you'll understand how to partition your dataset correctly using `TimeSeriesSplit` from the `sklearn.model_selection` library.

## Introduction to Data Leakage in Time Series

Data leakage occurs when information from outside the training dataset inadvertently makes its way into the model. This is particularly problematic in time series data, where the natural temporal ordering is crucial. Data leakage can lead to an overestimation of a model's performance because it allows information from the future to be used in making predictions about the past.

When dealing with stock market data, using future prices to predict past prices would artificially inflate a model's accuracy and yield unreliable predictions for actual trading strategies. Hence, it's important to ensure that our training and testing sets are separated in a way that respects the temporal nature of the data.

## Revisiting Feature Engineering and Scaling (Revision)

Let's quickly revise how to engineer features and scale them. These steps are foundational for preparing your data for machine learning models.

```python
import pandas as pd
import datasets
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = datasets.load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(data['train'])

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
```

In this snippet, we create two new features, `High-Low` and `Price-Open`, and scale these features using `StandardScaler`.

## Correctly Splitting Time Series Data

To avoid data leakage in time series, we need to split our data so that future data points are not used to predict past data points. `TimeSeriesSplit` from the `sklearn.model_selection` library helps achieve this.

The `TimeSeriesSplit` class helps you create train/test splits that respect the temporal order of your data. One of the key arguments in `TimeSeriesSplit` is `n_splits`, which specifies the number of re-shuffling and splitting iterations. Essentially, this determines how many different train/test splits will be generated from your data.

```python
from sklearn.model_selection import TimeSeriesSplit

# Initiate TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

# Splitting with TimeSeriesSplit
for fold, (train_index, test_index) in enumerate(tscv.split(features_scaled)):
    print(f"Fold {fold + 1}")
    print(f"TRAIN indices (first 5): {train_index[:5]}, TEST indices (first 5): {test_index[:5]}")
    
    # Splitting the features and target
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    # Print a small sample of the data
    print(f"X_train sample:\n {X_train[:2]}")
    print(f"y_train sample:\n {y_train[:2]}")
    print(f"X_test sample:\n {X_test[:2]}")
    print(f"y_test sample:\n {y_test[:2]}")
    print("-" * 10)
```

To elaborate, `TimeSeriesSplit` generates indices for multiple train/test splits, where the training set for each split consists of all data points up to a specific point in time, and the test set includes the subsequent data points in time. This sequential process respects the chronological order of the data. As a result, no future data points are included in the training set of any fold, which effectively prevents data leakage. This method ensures that our model training and evaluation simulate real-world scenarios more accurately, thereby providing reliable performance metrics.

## Analyzing Fold Results

Let's analyze the output from each fold to ensure correct data splitting. The output of the above code will be:

```sh
Fold 1
TRAIN indices (first 5): [0 1 2 3 4], TEST indices (first 5): [839 840 841 842 843]
X_train sample:
 [[-0.48165383  0.08560547  2.29693712]
 [-0.48579183 -0.02912844  2.00292929]]
y_train sample:
 [1.592667 1.588667]
X_test sample:
 [[-0.4714307  -0.11890593  0.26304787]
 [-0.42092366  0.03234206  1.43036618]]
y_test sample:
 [10.857333 10.964667]
----------
Fold 2
TRAIN indices (first 5): [0 1 2 3 4], TEST indices (first 5): [1675 1676 1677 1678 1679]
X_train sample:
 [[-0.48165383  0.08560547  2.29693712]
 [-0.48579183 -0.02912844  2.00292929]]
y_train sample:
 [1.592667 1.588667]
X_test sample:
 [[-0.46169462 -0.13046308  1.57995793]
 [-0.47447336  0.07639316  0.32446706]]
y_test sample:
 [17.066    17.133333]
----------
Fold 3
TRAIN indices (first 5): [0 1 2 3 4], TEST indices (first 5): [2511 2512 2513 2514 2515]
X_train sample:
 [[-0.48165383  0.08560547  2.29693712]
 [-0.48579183 -0.02912844  2.00292929]]
y_train sample:
 [1.592667 1.588667]
X_test sample:
 [[-0.27268857 -0.19528365  0.41906266]
 [-0.34291165 -0.09059793 -0.01236106]]
y_test sample:
 [66.726669 66.288002]
----------
```

This output confirms the correct operation of `TimeSeriesSplit`, showing how each set of training and testing indices progresses through the data without overlap, respecting the temporal order. This ensures that no future data is used when training the model.

## Best Practices in Time Series Data Splitting

Summarizing the key points:

- Always maintain temporal order when splitting time series data.
- Use `TimeSeriesSplit` to avoid data leakage.
- Verify the indices to ensure no future data is used in training.

Adhering to these practices ensures the reliability of your model's performance metrics and the validity of your predictions for real-world scenarios.

## Lesson Summary

In this lesson, you learned the importance of avoiding data leakage in time series datasets by using `TimeSeriesSplit`. With this method, you can maintain the integrity of your machine learning models and ensure their predictions are trustworthy for real-world financial trading tasks. Practicing these concepts and techniques will solidify your understanding and prepare you for more advanced machine-learning challenges.


## Adjusting TimeSeriesSplit to 5 Splits

The issue in your code lies in the `enumerate(features_scaled)` line where you attempt to iterate over `features_scaled`, which is a NumPy array, not an iterable that can be directly unpacked into `fold` and `(train_index, test_index)`.

To fix this, you should iterate over `tscv.split(features_scaled)` instead of `enumerate(features_scaled)`. Here’s the corrected code:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from datasets import load_dataset

# Load the Tesla stock dataset
tesla_dataset = load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(tesla_dataset['train'])

# Feature Engineering: Creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features and target
features = tesla_df[['High-Low','Price-Open','Volume']].values
target = tesla_df['Close'].values

# Scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Correctly Splitting Time Series Data
tscv = TimeSeriesSplit(n_splits=3)

# Splitting with TimeSeriesSplit and printing results
for fold, (train_index, test_index) in enumerate(tscv.split(features_scaled)):
    print(f"Fold {fold + 1}")
    print(f"TRAIN indices (first 5): {train_index[:5]}, TEST indices (first 5): {test_index[:5]}")
    
    # Splitting the features and target
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    # Print a small sample of the data
    print(f"X_train sample:\n {X_train[:2]}")
    print(f"y_train sample:\n {y_train[:2]}")
    print(f"X_test sample:\n {X_test[:2]}")
    print(f"y_test sample:\n {y_test[:2]}")
    print("-" * 10)
```

### Explanation:
- **TimeSeriesSplit**: This class from `sklearn.model_selection` is designed specifically for splitting time series data into training and test sets.
- **enumerate(tscv.split(features_scaled))**: This correctly iterates over the splits generated by `TimeSeriesSplit`, providing `fold` (the fold number) and `(train_index, test_index)` (indices for training and test sets).

By using `tscv.split(features_scaled)`, you ensure that the data splitting respects the temporal order of your time series data, which is crucial for time series analysis and modeling.

## Fixing Time Series Data Split

Sure! Here's the completed code with the missing parts filled in to ensure that the temporal order is maintained and that the data are properly scaled for each fold:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from datasets import load_dataset

# Load Tesla stock data
dataset = load_dataset('codesignal/tsla-historic-prices')
tesla_df = pd.DataFrame(dataset['train'])

# Feature Engineering
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Define features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Create a TimeSeriesSplit instance
tscv = TimeSeriesSplit(n_splits=3)

for fold, (train_index, test_index) in enumerate(tscv.split(features)):
    print(f"Fold {fold + 1}")
    print(f"TRAIN indices (first & last 5): {train_index[:5]}, {train_index[-5:]}")
    print(f"TEST indices (first 5): {test_index[:5]}")
    
    # Splitting the features and target
    X_train, X_test = features[train_index], features[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    # Create a scaler for each fold
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)

    # Use the fitted scaler to transform the test data
    X_test_scaled = scaler.transform(X_test)
    
    # Print a small sample of the data
    print(f"X_train sample:\n {X_train_scaled[:2]}")
    print(f"y_train sample:\n {y_train[:2]}")
    print(f"X_test sample:\n {X_test_scaled[:2]}")
    print(f"y_test sample:\n {y_test[:2]}")
    print("-" * 10)
```

### Explanation:
1. **TimeSeriesSplit instance (`tscv`)**: This is initialized with `n_splits=3` to split the data into 3 folds while preserving the temporal order.
2. **For loop over `tscv.split(features)`**: The code iterates over each fold, retrieving `train_index` and `test_index`, which are used to split the features and target into training and test sets.
3. **Scaler for each fold**: A `StandardScaler` is created and fitted to the training data only (`X_train`) to prevent data leakage. The training data is then scaled, and the same scaler is used to transform the test data (`X_test`).
4. **Output**: The code prints the first two rows of the scaled training and test sets, along with the corresponding target values, for a quick check.

This code ensures that the data scaling is done properly without data leakage, which is critical for time series modeling.

## Ensure Proper Scaling in Time Series Splitting

Thank you for the guidance, Space Explorer! Here’s the completed code with the blanks filled in for feature engineering, scaling, and splitting the data while respecting the temporal order:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from datasets import load_dataset

# Load the dataset
tesla_df = load_dataset('codesignal/tsla-historic-prices', split='train').to_pandas()

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Scale the features using StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Initiate TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

# Splitting with TimeSeriesSplit
for fold, (train_index, test_index) in enumerate(tscv.split(features_scaled)):
    print(f"Fold {fold + 1}")
    print(f"TRAIN indices (first 5): {train_index[:5]}, TEST indices (first 5): {test_index[:5]}")
    
    # Split into features and target
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    # Print a small sample of the data
    print(f"X_train sample:\n {X_train[:2]}")
    print(f"y_train sample:\n {y_train[:2]}")
    print(f"X_test sample:\n {X_test[:2]}")
    print(f"y_test sample:\n {y_test[:2]}")
    print("-" * 10)
```

### Explanation:
1. **Scaling the Features**: The `StandardScaler` is used to standardize the features by removing the mean and scaling to unit variance. This is crucial before splitting the data, as it ensures the model treats each feature equally.
   
2. **Splitting the Data**: The code correctly uses `TimeSeriesSplit` to maintain the temporal order. The data is split into training and test sets within each fold, with the training data being scaled.

3. **Print Statements**: These help in verifying that the splits are being done correctly and that the scaling is applied as expected.

This code is now well-suited for handling financial time series data, respecting the sequential nature of the data while ensuring proper scaling and feature engineering. 🌌🚀

## Feature Scaling and Time Series Split

Thank you for the guidance, Space Explorer! Here’s the completed code with the blanks filled in for feature engineering, scaling, and splitting the data while respecting the temporal order:

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from datasets import load_dataset

# Load the dataset
tesla_df = load_dataset('codesignal/tsla-historic-prices', split='train').to_pandas()

# Feature Engineering: creating new features
tesla_df['High-Low'] = tesla_df['High'] - tesla_df['Low']
tesla_df['Price-Open'] = tesla_df['Close'] - tesla_df['Open']

# Defining features and target
features = tesla_df[['High-Low', 'Price-Open', 'Volume']].values
target = tesla_df['Close'].values

# Scale the features using StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Initiate TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)

# Splitting with TimeSeriesSplit
for fold, (train_index, test_index) in enumerate(tscv.split(features_scaled)):
    print(f"Fold {fold + 1}")
    print(f"TRAIN indices (first 5): {train_index[:5]}, TEST indices (first 5): {test_index[:5]}")
    
    # Split into features and target
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    # Print a small sample of the data
    print(f"X_train sample:\n {X_train[:2]}")
    print(f"y_train sample:\n {y_train[:2]}")
    print(f"X_test sample:\n {X_test[:2]}")
    print(f"y_test sample:\n {y_test[:2]}")
    print("-" * 10)
```

### Explanation:
1. **Scaling the Features**: The `StandardScaler` is used to standardize the features by removing the mean and scaling to unit variance. This is crucial before splitting the data, as it ensures the model treats each feature equally.
   
2. **Splitting the Data**: The code correctly uses `TimeSeriesSplit` to maintain the temporal order. The data is split into training and test sets within each fold, with the training data being scaled.

3. **Print Statements**: These help in verifying that the splits are being done correctly and that the scaling is applied as expected.

This code is now well-suited for handling financial time series data, respecting the sequential nature of the data while ensuring proper scaling and feature engineering. 🌌🚀

## Addressing Data Leakage in Time Series

Let's put everything together and write the code to load the Tesla stock data, engineer the necessary features, scale them, and use `TimeSeriesSplit` to split the data while ensuring there's no data leakage. I'll also show the indices and samples for each fold.

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from datasets import load_dataset

# Load the dataset 'codesignal/tsla-historic-prices' using `datasets` library and convert to a DataFrame
tesla_df = load_dataset('codesignal/tsla-historic-prices', split='train').to_pandas()

# Engineer new features: 'Moving_Average_10' and 'Returns'
# 'Moving_Average_10' is a 10-day Simple Moving Average
tesla_df['Moving_Average_10'] = tesla_df['Close'].rolling(window=10).mean()

# 'Returns' is the difference between the closing price and the open price
tesla_df['Returns'] = tesla_df['Close'] - tesla_df['Open']

# Drop NaN values created by the rolling mean
tesla_df.dropna(inplace=True)

# Define features array with 'Moving_Average_10', 'Returns', and 'Volume', and the target array as 'Close'
features = tesla_df[['Moving_Average_10', 'Returns', 'Volume']].values
target = tesla_df['Close'].values

# Scale the features using StandardScaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

# Initiate TimeSeriesSplit with 3 splits
tscv = TimeSeriesSplit(n_splits=3)

# Print indices and samples for each fold
for fold, (train_index, test_index) in enumerate(tscv.split(features_scaled)):
    print(f"Fold {fold + 1}")
    print(f"TRAIN indices (first 5): {train_index[:5]}, (last 5): {train_index[-5:]}")
    print(f"TEST indices (first 5): {test_index[:5]}, (last 5): {test_index[-5:]}")
    
    # Split into features and target
    X_train, X_test = features_scaled[train_index], features_scaled[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    # Print a small sample of the data
    print(f"X_train sample:\n {X_train[:2]}")
    print(f"y_train sample:\n {y_train[:2]}")
    print(f"X_test sample:\n {X_test[:2]}")
    print(f"y_test sample:\n {y_test[:2]}")
    print("-" * 10)
```

### Explanation:

1. **Loading the Dataset**: The Tesla stock dataset is loaded using the `datasets` library, and it's converted into a Pandas DataFrame for easier manipulation.

2. **Feature Engineering**:
   - `Moving_Average_10`: A 10-day simple moving average of the closing prices.
   - `Returns`: The difference between the closing and opening prices for the day.
   - We drop any NaN values generated due to the moving average calculation.

3. **Defining Features and Target**:
   - The features are selected as `Moving_Average_10`, `Returns`, and `Volume`.
   - The target is set as the `Close` price.

4. **Scaling**:
   - The features are scaled using `StandardScaler` to standardize the data.

5. **TimeSeriesSplit**:
   - `TimeSeriesSplit` is used with 3 splits to ensure that the temporal order is maintained.
   - For each fold, the code prints the first and last 5 indices of the training and test sets, along with samples from the scaled features and corresponding targets.

This code ensures that the time series data is split without leakage and is correctly scaled for each fold. This process is crucial for building reliable models on time series data, especially in financial analysis.