In [1]:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.metrics import accuracy_score
import lightgbm as lgb

# Downloading the data
ticker_symbol = "CL=F"
oil_data = yf.download(ticker_symbol, start="2020-01-01", end="2023-01-01")

# 1. Prepare the Data
# Calculate daily returns
oil_data['Return'] = oil_data['Close'].pct_change()
# Create the binary target variable
oil_data['Y'] = (oil_data['Return'] >= 0).astype(int)

# Feature Engineering: using OHLC and Volume as features
features = ['Open', 'High', 'Low', 'Close', 'Volume']
# In fact, this is not the correct way of using OHLC and volume,
# It is better to use ratios, not raw values to avoid overfitting.
# Here, the purpose is to provide you a sample code that TimeSeriesSplit works.
X = oil_data[features]
y = oil_data['Y']

# Drop the first row as it will have NaN for return
X = X.iloc[1:]
y = y.iloc[1:]

# 2. Time Series Split
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # 3. Model Training and Tuning
    # Define the model
    model = lgb.LGBMClassifier()
    
    # Define the parameter grid
    param_grid = {
        'n_estimators': [50, 100],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5]
    }
    
    # GridSearchCV
    gsearch = GridSearchCV(estimator=model, param_grid=param_grid, cv=TimeSeriesSplit(n_splits=3).split(X_train), scoring='accuracy')
    gsearch.fit(X_train, y_train)
    
    # Best model
    best_model = gsearch.best_estimator_
    
    # 4. Model Evaluation
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy for fold {train_index[-1]}: {accuracy:.2f}')

  _empty_series = pd.Series()
[*********************100%%**********************]  1 of 1 completed


Accuracy for fold 129: 0.56
Accuracy for fold 254: 0.42
Accuracy for fold 379: 0.66
Accuracy for fold 504: 0.58
Accuracy for fold 629: 0.70


# Differences between Time Series Split and 5-Fold Cross-Validation

When dealing with time series data, the way we validate our model is crucial because time series data have a time dimension, which introduces temporal dependencies between observations. This characteristic makes the validation process fundamentally different from that used for cross-sectional data. Let's delve into the differences between Time Series Split and traditional K-Fold Cross-Validation, specifically focusing on 5-Fold Cross-Validation for comparison.

## 5-Fold Cross-Validation (K-Fold Cross-Validation)

In K-Fold Cross-Validation, the data is randomly divided into 'K' folds. For each fold 'i':
- The fold 'i' is used as the validation set.
- The remaining 'K-1' folds are used as the training set.
- The model is trained on the 'K-1' folds and evaluated on the 'i-th' fold.
- This process is repeated 'K' times, each time with a different fold used as the validation set.

### Characteristics:
- Each data point gets to be in a validation set exactly once, and gets to be in a training set 'K-1' times.
- The order of the data is not preserved (random splitting).

![5-Fold Cross-Validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

_Image Source: scikit-learn.org_

## Time Series Split

Time Series Split is a variation of cross-validation used for time-ordered data. It's a more appropriate choice when you're working with time series data. In Time Series Split:
- The dataset is split into 'K' consecutive folds.
- Unlike in K-Fold Cross-Validation, the validation set for each fold consists of data points that come after all the data points in the training set, preserving the temporal order of observations.

### Characteristics:
- Ensures that the training set always precedes the validation set.
- Prevents the model from seeing future data at the training time.
- More suitable for time series data where temporal ordering matters.

![Time Series Split](https://miro.medium.com/max/1400/1*2-zaRQ-dsv8KWxOlzc8VaA.png)

_Image Source: Towards Data Science (miro.medium.com)_

## Key Differences

1. **Randomization**: 
   - 5-Fold Cross-Validation randomly splits the data, ignoring the time component.
   - Time Series Split preserves the order of data, respecting the time component.

2. **Data Leakage**:
   - 5-Fold Cross-Validation can cause data leakage in time series datasets if future data is used to predict past events.
   - Time Series Split prevents data leakage by ensuring the model is only trained on past data to predict future data.

3. **Model Evaluation**:
   - In time series forecasting, model performance can vary significantly depending on the period being predicted. Time Series Split allows the model's performance to be evaluated on different periods, reflecting more realistic forecasting scenarios.
   - 5-Fold Cross-Validation evaluates the model on random subsets of data, which may not provide an accurate assessment of the model's forecasting ability over time.

## Conclusion

Choosing the right cross-validation technique is crucial in time series analysis to avoid data leakage and ensure that the model is evaluated in a manner that reflects its practical use. Time Series Split is typically preferred over K-Fold Cross-Validation for time series data due to its ability to handle the temporal structure of the data.
