# Transformer-Based Time Series Forecasting: Stock Price Prediction with Keras and TensorFlow

## 1. Introduction
Welcome! This notebook provides a comprehensive guide to implementing cutting-edge Transformer architectures for financial time series forecasting. Unlike traditional RNNs that process sequences step-by-step, Transformers leverage self-attention mechanisms to capture long-term dependencies and complex patterns in time series data.

The script (`main.py`) performs several key operations:
1. Generates synthetic stock price data that mimics real market behavior.
2. Implements a complete Transformer architecture from scratch using Keras with TensorFlow.
3. Trains the model to forecast future stock prices using multi-head self-attention.
4. Visualizes the training process and model architecture using TensorBoard and Visualkeras.
5. Evaluates the model's performance and makes predictions on time series data.

This implementation provides a solid foundation for real-world time series forecasting applications, from financial markets to demand planning and sensor data analysis.

## 📺 Watch the Tutorial

Prefer a video walkthrough? Check out the accompanying tutorial on YouTube:

[Transformer Time Series Forecasting Tutorial](https://youtu.be/LNydD9ZemZ8)

## 🚀 Quick Start Guide

Ready to run the code? Follow these simple steps to set up your environment and execute the Transformer time series forecasting model:

### Step 1: Create a Python Virtual Environment
```bash
# Create a new virtual environment
python -m venv transformer_timeseries

# Activate the virtual environment
# On Windows:
transformer_timeseries\Scripts\activate

# On macOS/Linux:
source transformer_timeseries/bin/activate
```

### Step 2: Install Required Dependencies
```bash
# Install all required packages
pip install -r requirements.txt
```

The `requirements.txt` file includes:
- `tensorflow>=2.0` - Deep learning framework
- `numpy` - Numerical computing
- `matplotlib` - Plotting and visualization
- `visualkeras` - Model architecture visualization
- `pandas` - Data manipulation
- `scikit-learn` - Machine learning utilities

### Step 3: Run the Main Script
```bash
# Execute the Transformer time series forecasting model
python main.py
```

### What Happens When You Run It?
1. **Data Generation:** Creates synthetic stock price data with realistic trends and noise
2. **Model Building:** Constructs the Transformer architecture with multi-head attention
3. **Training:** Trains the model for 20 epochs with TensorBoard logging
4. **Visualization:** Generates model architecture diagrams and training plots
5. **Evaluation:** Provides performance metrics (MSE, MAE, RMSE)
6. **Prediction:** Creates forecasts and saves comparison plots

### Monitoring Training Progress
The script automatically sets up TensorBoard logging. After running, you can monitor training in real-time:
```bash
# Launch TensorBoard (the script will show you the exact command)
tensorboard --logdir logs/[timestamp]
```

Then open your browser to `http://localhost:6006` to view:
- Training/validation loss curves
- Model architecture visualization
- Performance metrics over time
- Model weights and gradients

**Expected Runtime:** Approximately 5-10 minutes on a modern CPU, faster with GPU acceleration.

### Architecture Overview
The complete workflow of our Transformer-based time series forecasting system:

![Transformer Time Series Architecture](time_series_transformers.png)

This diagram illustrates the end-to-end pipeline from raw time series data through the Transformer architecture to final predictions.

## 2. Core Concepts: Why Transformers for Time Series?

### Traditional Approaches vs. Transformers
Traditionally, time series forecasting relied on:
1. **Statistical methods** like ARIMA models or exponential smoothing - work well for simple patterns but struggle with complex, non-linear relationships.
2. **Recurrent Neural Networks** like LSTMs or GRUs - process sequences step by step, building context as they move through time.

**The fundamental limitation:** RNNs process information sequentially, meaning they can only look at past information when processing the current time step, and they often struggle with very long-term dependencies due to the vanishing gradient problem.

### The Transformer Revolution
Transformers completely revolutionize this approach:
- **Parallel Processing:** Instead of processing sequences step by step, Transformers can look at all time steps simultaneously through self-attention.
- **Long-Range Dependencies:** They can capture relationships between time steps that are very far apart - crucial for financial forecasting where events from weeks or months ago can influence current prices.
- **Multiple Perspectives:** Multi-head attention allows the model to focus on different types of patterns simultaneously (short-term momentum, long-term trends, volatility patterns).

Think of it like having a team of analysts where each analyst can instantly consult with every other analyst about any time period, rather than passing information through a chain.

### Self-Attention Mechanism: The Core Innovation
Self-attention works by creating three different representations of input data:
- **Queries (Q):** "What information am I looking for?"
- **Keys (K):** "What information is available at each time step?"
- **Values (V):** "What information should I extract and combine?"

The attention mechanism computes similarity scores between queries and keys, then uses these scores to create weighted combinations of values. This allows each time step to directly attend to any other time step, regardless of distance.

### Fixed-Length Windows and Batching Strategy
Our implementation uses a sliding window approach:
1. **Window Size (seq_len):** Choose how much history the model sees at once (e.g., 100 time steps).
2. **Sliding Windows:** Create overlapping sequences from the time series data.
3. **Batch Processing:** Process multiple windows in parallel for efficient GPU utilization.
4. **Embedding Space:** Project scalar values into high-dimensional space for rich representations.

This approach maximizes training samples while ensuring consistent input shapes for the attention mechanism.

### Visualization Tools: TensorBoard and Visualkeras
- **TensorBoard:** TensorFlow's visualization toolkit for monitoring training metrics, visualizing model graphs, and examining weight distributions.
- **Visualkeras:** Creates beautiful, layered diagrams of Keras model architectures, helping understand the complex Transformer structure.

## 3. Code Deep Dive: Complete Transformer Implementation

### File Structure Overview
The entire implementation is contained in `main.py`, featuring:
- Custom Transformer layers built from scratch
- Multi-head self-attention mechanism
- Complete training pipeline with monitoring
- Data preprocessing and evaluation utilities

### Environment Setup and GPU Configuration
```python
import os
from datetime import datetime
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Layer, Dense, LayerNormalization, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import Input
import matplotlib.pyplot as plt
import visualkeras

# Configure GPU memory growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(f"{len(gpus)} Physical GPUs, {len(logical_gpus)} Logical GPUs configured.")
    except RuntimeError as e:
        print(f"Error setting up GPU memory growth: {e}")
```
**Key Points:**
- GPU memory growth prevents TensorFlow from allocating all GPU memory at startup
- This is crucial for running multiple processes or handling limited memory scenarios
- Memory is allocated dynamically as needed during training

### MultiHeadSelfAttention: The Heart of the Transformer
```python
class MultiHeadSelfAttention(Layer):
    def __init__(self, embed_dim, num_heads=8):
        super(MultiHeadSelfAttention, self).__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.projection_dim = embed_dim // num_heads
        
        if embed_dim % num_heads != 0:
            raise ValueError(
                f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})"
            )

        self.query_dense = Dense(embed_dim)
        self.key_dense = Dense(embed_dim)
        self.value_dense = Dense(embed_dim)
        self.combine_heads = Dense(embed_dim)
```
**Architecture Details:**
- `embed_dim`: Dimensionality of input embeddings (e.g., 128)
- `num_heads`: Number of parallel attention mechanisms (e.g., 8)
- `projection_dim`: Dimension per head (embed_dim ÷ num_heads)
- Three Dense layers create Q, K, V projections
- Final Dense layer combines all heads back together

### Attention Score Computation
```python
def attention(self, query, key, value):
    score = tf.matmul(query, key, transpose_b=True)
    
    dim_key = tf.cast(tf.shape(key)[-1], tf.float32)
    scaled_score = score / tf.math.sqrt(dim_key)
    
    weights = tf.nn.softmax(scaled_score, axis=-1)
    
    output = tf.matmul(weights, value)
    return output, weights
```
**Mathematical Foundation:**
1. **Similarity Scores:** Compute dot products between queries and keys (Q·K^T)
2. **Scaling:** Divide by √d_k to prevent softmax saturation in high dimensions
3. **Attention Weights:** Apply softmax to get probability distribution
4. **Weighted Values:** Multiply attention weights with values to get output

This creates an n×n attention matrix where each element represents how much one time step should attend to another.

### Multi-Head Processing and Head Splitting
```python
def split_heads(self, x, batch_size):
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.projection_dim))
    return tf.transpose(x, perm=[0, 2, 1, 3])

def call(self, inputs):
    batch_size = tf.shape(inputs)[0]
    
    query = self.query_dense(inputs)
    key = self.key_dense(inputs)
    value = self.value_dense(inputs)
    
    query = self.split_heads(query, batch_size)
    key = self.split_heads(key, batch_size)
    value = self.split_heads(value, batch_size)
    
    attention_output, _ = self.attention(query, key, value)
    
    attention_output = tf.transpose(attention_output, perm=[0, 2, 1, 3])
    concat_attention = tf.reshape(attention_output, (batch_size, -1, self.embed_dim))
    
    return self.combine_heads(concat_attention)
```
**Multi-Head Benefits:**
- Each head focuses on different aspects (short-term trends, long-term patterns, volatility)
- Parallel processing of multiple attention mechanisms
- Richer representation through diverse attention patterns
- Final combination layer learns optimal head integration

### TransformerBlock: Complete Building Block
```python
class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadSelfAttention(embed_dim, num_heads)
        self.ffn = tf.keras.Sequential([
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim),
        ])
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=False):
        attn_output = self.att(inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)
```
**Architecture Components:**
- **Multi-Head Attention:** Captures relationships between time steps
- **Feed-Forward Network:** Two-layer MLP for position-wise transformations
- **Residual Connections:** Add input to each sub-layer output (crucial for deep networks)
- **Layer Normalization:** Stabilizes training and accelerates convergence
- **Dropout:** Regularization to prevent overfitting

### TransformerEncoder: Stacking Multiple Blocks
```python
class TransformerEncoder(Layer):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerEncoder, self).__init__()
        self.enc_layers = [TransformerBlock(embed_dim, num_heads, ff_dim, rate)
                           for _ in range(num_layers)]
        self.dropout = Dropout(rate)

    def call(self, inputs, training=False):
        x = inputs
        x = self.dropout(x, training=training)
        
        for layer in self.enc_layers:
            x = layer(x, training=training)
        return x
```
**Hierarchical Learning:**
- Each layer builds upon representations from previous layers
- First layers might learn basic patterns (local trends, seasonal patterns)
- Deeper layers identify complex relationships and long-term dependencies
- Stacking enables increasingly sophisticated feature extraction

### Data Preprocessing: Creating Time Series Sequences
```python
def create_dataset(data, time_step=1):
    X, Y = [], []
    for i in range(len(data) - time_step - 1):
        X.append(data[i:(i + time_step), 0])
        Y.append(data[i + time_step, 0])
    return np.array(X), np.array(Y)
```
**Sliding Window Approach:**
- Creates overlapping sequences of fixed length from time series
- Example: [1,2,3,4,5,6] with time_step=3 → inputs [1,2,3], [2,3,4], [3,4,5] with targets 4, 5, 6
- Maximizes training examples from single time series
- Mimics real-world forecasting scenario (fixed historical window)

### Complete Model Architecture
```python
def build_model(time_step, embed_dim=128, num_heads=8, ff_dim=512, 
                num_layers=4, dropout_rate=0.1):
    
    inputs = Input(shape=(time_step, 1))
    
    x = Dense(embed_dim)(inputs)
    
    encoder = TransformerEncoder(num_layers, embed_dim, num_heads, ff_dim, dropout_rate)
    x = encoder(x)
    
    x = tf.keras.layers.Flatten()(x)
    x = Dropout(dropout_rate)(x)
    outputs = Dense(1)(x)
    
    return Model(inputs, outputs)
```
**Pipeline Architecture:**
1. **Input Layer:** Accepts sequences of shape (time_step, 1)
2. **Embedding Layer:** Projects scalars to high-dimensional space
3. **Transformer Encoder:** Processes sequences with self-attention
4. **Flatten Layer:** Concatenates all time step representations
5. **Output Layer:** Final dense layer for price prediction

### Training Pipeline with Monitoring
```python
def main():
    # Generate synthetic stock data
    np.random.seed(42)
    data_length = 2000
    trend = np.linspace(100, 200, data_length)
    noise = np.random.normal(0, 2, data_length)
    synthetic_data = trend + noise
    df = pd.DataFrame(synthetic_data, columns=['Close'])
    
    # Data normalization
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled_data = scaler.fit_transform(df[['Close']].values)
    
    # Prepare sequences
    time_step = 100
    X, Y = create_dataset(scaled_data, time_step)
    X = X.reshape((X.shape[0], X.shape[1], 1))
    
    # Build and compile model
    model = build_model(time_step, embed_dim=128, num_heads=8, 
                        ff_dim=512, num_layers=4, dropout_rate=0.1)
    model.compile(optimizer='adam', loss='mse', 
                  metrics=['mae', tf.keras.metrics.RootMeanSquaredError(name='rmse')])
```
**Key Design Choices:**
- **Synthetic Data:** Linear trend + Gaussian noise simulates realistic price movements
- **Normalization:** MinMaxScaler ensures stable training gradients
- **Hyperparameters:** 100 time steps (≈3 months), 128-dim embeddings, 8 heads, 4 layers
- **Optimization:** Adam optimizer with MSE loss for regression

### TensorBoard and Visualkeras Integration
```python
    # Setup logging
    logdir = os.path.join("logs", datetime.now().strftime("%Y%m%d-%H%M%S"))
    os.makedirs(logdir, exist_ok=True)
    
    # Generate architecture diagram
    arch_path = os.path.join(logdir, 'model_visualkeras.png')
    visualkeras.layered_view(model, to_file=arch_path, legend=True, 
                            draw_volume=False, scale_xy=1.5, scale_z=1, spacing=20)
    
    # TensorBoard callback
    tensorboard_cb = tf.keras.callbacks.TensorBoard(
        log_dir=logdir, histogram_freq=1, write_graph=True, 
        write_images=True, update_freq='epoch', profile_batch=1
    )
    
    # Train with monitoring
    history = model.fit(X, Y, epochs=20, batch_size=32, 
                        validation_split=0.1, callbacks=[tensorboard_cb])
```
**Monitoring Features:**
- **Timestamped Logs:** Organized logging structure
- **Architecture Visualization:** Beautiful model diagrams
- **TensorBoard Integration:** Real-time training metrics, weight histograms, computational graphs
- **Validation Monitoring:** Track overfitting with held-out data

### Model Evaluation and Prediction Visualization
```python
    # Evaluate model
    evaluation_results = model.evaluate(X, Y)
    loss, mae, rmse = evaluation_results
    print(f"Test loss (MSE): {loss:.6f}")
    print(f"Test MAE: {mae:.6f}")
    print(f"Test RMSE: {rmse:.6f}")
    
    # Make predictions and inverse transform
    predictions = model.predict(X)
    predictions = scaler.inverse_transform(predictions)
    
    # Visualization
    plt.figure(figsize=(10, 6))
    plt.plot(df['Close'], label='True Data')
    plt.plot(np.arange(time_step, time_step + len(predictions)), 
             predictions, label='Predictions')
    plt.title('Transformer Time Series Forecasting')
    plt.xlabel('Time')
    plt.ylabel('Stock Price')
    plt.legend()
```
**Evaluation Metrics:**
- **MSE (Mean Squared Error):** Penalizes large errors heavily
- **MAE (Mean Absolute Error):** Average absolute prediction error
- **RMSE (Root Mean Squared Error):** Square root of MSE, same units as target
- **Visual Assessment:** Qualitative evaluation of prediction quality

## 4. Setup and Running the Application

### Prerequisites
- Python 3.8+ (recommended)
- CUDA-compatible GPU (optional but recommended for performance)
- `pip` package manager

### Installation Steps
1. **Clone the repository:**
   ```bash
   git clone <your-repo-url>
   cd transformers/time_series_forecast
   ```

2. **Create virtual environment:**
   ```bash
   python3 -m venv transformer_env
   ```

3. **Activate virtual environment:**
   - Linux/macOS:
     ```bash
     source transformer_env/bin/activate
     ```
   - Windows:
     ```bash
     transformer_env\Scripts\activate
     ```

4. **Install required packages:**
   ```bash
   pip install tensorflow numpy pandas scikit-learn matplotlib visualkeras
   ```

5. **For GPU support (optional):**
   ```bash
   pip install tensorflow-gpu
   ```

### Required Dependencies (requirements.txt)
```txt
tensorflow>=2.8.0
numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
visualkeras>=0.0.2
```

### Running the Project
Execute the main script:
```bash
python main.py
```

This will:
- Generate synthetic stock price data (2000 time points)
- Build the Transformer model from scratch
- Train for 20 epochs with validation monitoring
- Create visualizations and save to timestamped log directory
- Evaluate model performance and display metrics
- Generate prediction plots

**Expected Output:**
```
1 Physical GPUs, 1 Logical GPUs configured.
Model: "model"
...
Total params: 2,459,009
Trainable params: 2,459,009
TensorBoard logs in: /path/to/logs/20231201-143022
Run: tensorboard --logdir logs/20231201-143022
Epoch 1/20
...
Test loss (MSE): 0.000123
Test MAE: 0.008456
Test RMSE: 0.011089
```

### Monitoring with TensorBoard
Start TensorBoard to monitor training:
```bash
tensorboard --logdir logs
```

Navigate to `http://localhost:6006` to view:
- **Scalars:** Training/validation loss, MAE, RMSE over epochs
- **Graphs:** Complete model computational graph
- **Histograms:** Weight and bias distributions
- **Images:** Model architecture diagrams
- **Profiler:** Performance analysis and bottleneck identification

### Customization Options
Modify hyperparameters in the `build_model()` call:
```python
model = build_model(
    time_step=100,        # Historical window size
    embed_dim=128,        # Embedding dimension
    num_heads=8,          # Number of attention heads
    ff_dim=512,           # Feed-forward dimension
    num_layers=4,         # Number of Transformer blocks
    dropout_rate=0.1      # Dropout rate for regularization
)
```

**Hyperparameter Guidelines:**
- Increase `time_step` for longer historical context (but higher memory usage)
- Increase `embed_dim` for richer representations (but more parameters)
- More `num_heads` captures diverse attention patterns
- Deeper `num_layers` for complex pattern learning (but harder training)
- Adjust `dropout_rate` based on overfitting behavior

## 5. Advanced Topics and Extensions

### Positional Encoding (Not Implemented Here)
Traditional Transformers use positional encodings to help the model understand sequence order. For time series:
```python
def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                           np.arange(d_model)[np.newaxis, :],
                           d_model)
    # Apply sin to even indices
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    # Apply cos to odd indices  
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    return tf.cast(angle_rads, dtype=tf.float32)
```

### Causal (Masked) Attention for Autoregressive Forecasting
For truly autoregressive prediction, implement causal masking:
```python
def create_causal_mask(seq_len):
    mask = tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    return mask  # Lower triangular matrix

# In attention computation:
if mask is not None:
    scaled_score += (mask * -1e9)  # Large negative value for masked positions
```

### Multi-Variate Time Series Extension
Extend to multiple features (price, volume, indicators):
```python
# Input shape becomes (batch_size, seq_len, num_features)
inputs = Input(shape=(time_step, num_features))  # e.g., num_features=5
x = Dense(embed_dim)(inputs)  # Projects each feature vector to embed_dim
```

### Real-World Data Integration
Replace synthetic data with real financial data:
```python
import yfinance as yf

# Download real stock data
ticker = "AAPL"
data = yf.download(ticker, start="2020-01-01", end="2023-12-31")
prices = data['Close'].values.reshape(-1, 1)
```

## 6. Performance Optimization Tips

### Memory Management
- **Gradient Checkpointing:** For very deep models, trade computation for memory
- **Mixed Precision Training:** Use float16 for forward pass, float32 for gradients
- **Sequence Length vs. Memory:** Quadratic scaling O(seq_len²) in attention

### Training Acceleration
- **Learning Rate Scheduling:** Warm-up and decay strategies
- **Batch Size Optimization:** Larger batches for GPU efficiency
- **Multi-GPU Training:** Distribute across multiple devices

## 7. Troubleshooting Common Issues

### Training Issues
- **Loss not decreasing:** Check learning rate, data normalization, model capacity
- **Exploding gradients:** Reduce learning rate, add gradient clipping
- **Overfitting:** Increase dropout, reduce model size, add regularization
- **GPU out of memory:** Reduce batch size or sequence length

### Model Performance
- **Poor predictions:** Increase model capacity, check data quality, extend training
- **Slow convergence:** Adjust learning rate schedule, check normalization
- **Unstable training:** Use gradient clipping, reduce learning rate

## 8. Conclusion
This notebook has provided a comprehensive walkthrough of implementing Transformer architecture for time series forecasting. We've covered:

- **Theoretical Foundation:** Self-attention mechanism and multi-head attention
- **Complete Implementation:** From scratch Transformer blocks in Keras
- **Training Pipeline:** Data preprocessing, model compilation, and monitoring
- **Practical Applications:** Stock price prediction with real-world considerations

**Key Advantages of Transformers for Time Series:**
- Parallel processing enables faster training than RNNs
- Direct long-range dependency modeling
- Multiple attention heads capture diverse patterns
- Scalable architecture for complex temporal relationships

**Next Steps:**
- Experiment with different hyperparameters
- Try real financial datasets
- Implement positional encodings and causal masking
- Extend to multivariate forecasting
- Deploy for real-time prediction systems

This implementation provides a solid foundation for advanced time series forecasting applications across finance, demand planning, sensor analytics, and beyond. The Transformer's flexibility makes it adaptable to many sequential prediction problems where capturing long-term dependencies is crucial.