# ROCKET Transformer: Random Convolutional Kernel Transform

**ROCKET** (RandOm Convolutional KErnel Transform) is a fast and accurate time series classification method that transforms time series into features using thousands of random convolutional kernels.

## Key Intuition

Instead of learning kernels (like CNNs), ROCKET generates **random kernels** with random:
- **Weights** sampled from {-1, 0, 1}
- **Lengths** ‚àà {7, 9, 11}
- **Biases** sampled uniformly
- **Dilations** for multi-scale pattern detection
- **Padding** (valid or same)

Each kernel extracts **two features**:
1. **PPV (Proportion of Positive Values)**: Captures how often the pattern appears
2. **Global Max Pooling**: Captures the strongest activation

With 10,000 kernels ‚Üí 20,000 features, combined with a simple linear classifier (Ridge), this achieves state-of-the-art accuracy!

---

## 1. Mathematical Foundation

### Kernel Definition

A ROCKET kernel is defined as a tuple:

$$k = (w, l, b, d, p)$$

Where:
- $w \in \{-1, 0, 1\}^l$ ‚Äî **kernel weights** (randomly sampled)
- $l \in \{7, 9, 11\}$ ‚Äî **kernel length**
- $b \in \mathbb{R}$ ‚Äî **bias** (sampled from quantiles of convolution output)
- $d \in \mathbb{Z}^+$ ‚Äî **dilation factor** (exponentially distributed)
- $p \in \{0, \lfloor(l-1) \cdot d / 2\rfloor\}$ ‚Äî **padding** (valid or same)

### Dilated Convolution Operation

For an input time series $x$ of length $T$, the dilated convolution output at position $i$ is:

$$z_i = \sum_{j=0}^{l-1} w_j \cdot x_{i + j \cdot d} + b$$

The **dilation factor** $d$ determines the spacing between kernel elements:
- $d = 1$: Standard convolution (consecutive elements)
- $d = 2$: Skip every other element (receptive field = $2l - 1$)
- $d = k$: Receptive field spans $(l-1) \cdot k + 1$ time steps

### Feature Extraction

From each convolution output $z = (z_1, z_2, \ldots, z_n)$, we extract **two features**:

#### Proportion of Positive Values (PPV)

$$\text{PPV}(z) = \frac{1}{|z|}\sum_{i=1}^{|z|} \mathbf{1}_{z_i > 0}$$

**Intuition**: PPV measures *how frequently* a pattern appears in the series. A high PPV means the pattern (after bias adjustment) is common.

#### Global Max Pooling (GMP)

$$\text{GMP}(z) = \max_{i=1}^{|z|} z_i$$

**Intuition**: GMP captures the *strongest match* to the pattern anywhere in the series.

### Total Feature Count

With $n_{\text{kernels}}$ random kernels, the total feature dimension is:

$$\text{Features} = 2 \times n_{\text{kernels}}$$

For the default 10,000 kernels ‚Üí **20,000 features**.

---

In [None]:
## 2. Setup and Data Loading

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sktime.datasets import load_basic_motions, load_unit_test

np.random.seed(42)

In [None]:
X_train, y_train = load_basic_motions(split="train", return_X_y=True)
X_test, y_test = load_basic_motions(split="test", return_X_y=True)



---

## 9. Using sktime's ROCKET (Production Ready)

For production use, always prefer the optimized sktime implementation:

In [None]:
from sklearn.linear_model import RidgeClassifierCV
from sklearn.model_selection import cross_val_score

# Standardize features
features_scaled = (features - features.mean(axis=0)) / (features.std(axis=0) + 1e-8)

# Cross-validation
clf = RidgeClassifierCV(alphas=np.logspace(-3, 3, 10))
scores = cross_val_score(clf, features_scaled, y_synthetic, cv=5)

print(f"üéØ Classification Results (5-fold CV):")
print(f"   Accuracy: {scores.mean():.2%} ¬± {scores.std():.2%}")

# Visualize the transformed feature space
pca_custom = PCA(n_components=2)
features_pca = pca_custom.fit_transform(features_scaled)

fig = px.scatter(
    x=features_pca[:, 0], y=features_pca[:, 1],
    color=['Sine' if y == 0 else 'Sawtooth' for y in y_synthetic],
    title="<b>Custom ROCKET Implementation: Feature Space</b><br><sup>Synthetic data: Sine waves vs Sawtooth waves</sup>",
    labels={'x': 'PC1', 'y': 'PC2', 'color': 'Class'},
    template="plotly_white"
)
fig.update_traces(marker=dict(size=12, opacity=0.8))
fig

## 8. Validate Our Implementation

Let's verify that our low-level implementation produces reasonable results by training a classifier on our synthetic data.

In [None]:
def rocket_transform(X, n_kernels, random_state=42):
    """
    Full ROCKET transformation: time series ‚Üí feature vectors.
    
    Parameters
    ----------
    X : np.array
        Time series data of shape (n_samples, series_length)
    n_kernels : int
        Number of random kernels to generate
    random_state : int
        Random seed for reproducibility
        
    Returns
    -------
    np.array : Feature matrix of shape (n_samples, 2 * n_kernels)
        Each kernel produces 2 features (PPV and max)
    
    Algorithm
    ---------
    1. Generate n_kernels random kernels
    2. For each kernel:
       a. Compute convolution output for all series
       b. Set bias based on quantile of outputs
       c. Extract PPV and max features
    3. Return concatenated features
    """
    np.random.seed(random_state)
    
    n_samples, series_length = X.shape
    
    # Output: 2 features per kernel (PPV and max)
    features = np.zeros((n_samples, 2 * n_kernels))
    
    # Store kernels for reference
    kernels = []
    
    for k in range(n_kernels):
        # Generate random kernel
        kernel_params = generate_rocket_kernel(series_length, random_state=None)
        kernels.append(kernel_params)
        
        weights = kernel_params['weights']
        dilation = kernel_params['dilation']
        padding = kernel_params['padding']
        
        # Compute convolution for all samples to determine bias
        all_conv_outputs = []
        for i in range(n_samples):
            conv_out = dilated_convolution(X[i], weights, dilation, padding)
            all_conv_outputs.append(conv_out)
        
        # Set bias based on quantile of all convolution outputs
        # This makes PPV distribution roughly uniform across kernels
        all_values = np.concatenate(all_conv_outputs)
        if len(all_values) > 0:
            bias = -np.quantile(all_values, np.random.uniform(0, 1))
        else:
            bias = 0.0
        
        # Extract features for each sample
        for i in range(n_samples):
            ppv, gmp = compute_ppv_and_max(all_conv_outputs[i], bias)
            features[i, 2 * k] = ppv
            features[i, 2 * k + 1] = gmp
    
    return features, kernels

# Test on a small dataset
print("üöÄ Testing ROCKET Implementation...")
print("-" * 50)

# Create simple synthetic data
np.random.seed(0)
n_samples = 20
series_length = 100

# Two classes: sine waves vs sawtooth waves
X_sine = np.sin(np.linspace(0, 4*np.pi, series_length)) + 0.1 * np.random.randn(n_samples//2, series_length)
X_saw = np.tile(np.linspace(-1, 1, series_length//4), 4) + 0.1 * np.random.randn(n_samples//2, series_length)
X_synthetic = np.vstack([X_sine, X_saw])
y_synthetic = np.array([0]*(n_samples//2) + [1]*(n_samples//2))

# Apply ROCKET
features, kernels = rocket_transform(X_synthetic, n_kernels=100, random_state=42)

print(f"Input shape: {X_synthetic.shape}")
print(f"Output shape: {features.shape}")
print(f"Number of kernels: {len(kernels)}")
print(f"\nFirst 5 features of sample 0: {features[0, :5]}")

In [None]:
def compute_ppv_and_max(conv_output, bias=0.0):
    """
    Compute the two ROCKET features from convolution output.
    
    Parameters
    ----------
    conv_output : np.array
        Output from dilated convolution
    bias : float
        Bias term subtracted before computing PPV
        
    Returns
    -------
    tuple : (ppv, global_max)
        - ppv: Proportion of Positive Values after bias
        - global_max: Maximum value in convolution output
    
    Mathematical Formulas
    ---------------------
    PPV = (1/n) * sum(1_{z_i > 0})  where z_i = conv_output[i] + bias
    GMP = max(conv_output)
    
    Intuition
    ---------
    - PPV: Measures how frequently the pattern appears (high = common pattern)
    - GMP: Captures the strongest match to the pattern anywhere in the series
    """
    if len(conv_output) == 0:
        return 0.0, 0.0
    
    # Add bias to convolution output
    z = conv_output + bias
    
    # PPV: Proportion of values > 0
    ppv = np.mean(z > 0)
    
    # Global Max Pooling
    global_max = np.max(conv_output)  # Note: max is computed before bias in original paper
    
    return ppv, global_max

# Example
example_conv = np.array([-2, -1, 0, 1, 2, 3, 2, 1, 0, -1])
ppv, gmp = compute_ppv_and_max(example_conv, bias=0)
print(f"Example convolution output: {example_conv}")
print(f"PPV (bias=0): {ppv:.2f} ({int(ppv * len(example_conv))}/{len(example_conv)} positive)")
print(f"Global Max: {gmp}")

In [None]:
def dilated_convolution(x, kernel, dilation, padding):
    """
    Perform 1D dilated convolution with optional padding.
    
    Parameters
    ----------
    x : np.array
        Input time series of shape (T,)
    kernel : np.array
        Convolution kernel of shape (L,)
    dilation : int
        Dilation factor (spacing between kernel elements)
    padding : int
        Number of zeros to pad on each side
        
    Returns
    -------
    np.array : Convolution output
    
    Mathematical Formula
    --------------------
    z_i = sum_{j=0}^{L-1} kernel[j] * x[i + j * dilation]
    
    where:
    - L is kernel length
    - dilation determines spacing between sampled input elements
    - Receptive field = (L - 1) * dilation + 1
    """
    L = len(kernel)
    receptive_field = (L - 1) * dilation + 1
    
    # Apply padding
    if padding > 0:
        x_padded = np.pad(x, (padding, padding), mode='constant', constant_values=0)
    else:
        x_padded = x
    
    # Calculate output length
    output_length = len(x_padded) - receptive_field + 1
    
    if output_length <= 0:
        return np.array([])
    
    # Compute dilated convolution
    output = np.zeros(output_length)
    
    for i in range(output_length):
        # Gather input samples with dilation spacing
        indices = i + np.arange(L) * dilation
        output[i] = np.dot(kernel, x_padded[indices])
    
    return output

# Test the dilated convolution
test_x = np.sin(np.linspace(0, 4 * np.pi, 50))
test_kernel = np.array([1, -2, 1])  # Second derivative approximation

for d in [1, 2, 4]:
    out = dilated_convolution(test_x, test_kernel, dilation=d, padding=0)
    print(f"Dilation={d}: Output length={len(out)}, Receptive field={(len(test_kernel)-1)*d + 1}")

In [None]:
def generate_rocket_kernel(input_length, random_state=None):
    """
    Generate a single random ROCKET kernel with all parameters.
    
    Parameters
    ----------
    input_length : int
        Length of the input time series
    random_state : int, optional
        Random seed for reproducibility
        
    Returns
    -------
    dict : Kernel parameters
        - weights: np.array of shape (length,) with values in {-1, 0, 1}
        - length: int, kernel length from {7, 9, 11}
        - bias: float, bias term (set later based on convolution output)
        - dilation: int, dilation factor
        - padding: int, padding amount
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    # 1. Sample kernel length from {7, 9, 11}
    length = np.random.choice([7, 9, 11])
    
    # 2. Sample weights from {-1, 0, 1} with equal probability
    weights = np.random.choice([-1, 0, 1], size=length).astype(np.float64)
    
    # Ensure at least one non-zero weight
    if np.all(weights == 0):
        weights[np.random.randint(length)] = np.random.choice([-1, 1])
    
    # 3. Calculate maximum dilation to fit within input length
    # Receptive field = (length - 1) * dilation + 1 <= input_length
    max_dilation = (input_length - 1) // (length - 1)
    
    # 4. Sample dilation from exponential distribution (biased toward small values)
    # This ensures we get both fine-grained and coarse patterns
    dilation = int(2 ** np.random.uniform(0, np.log2(max_dilation + 1)))
    dilation = max(1, min(dilation, max_dilation))  # Clamp to valid range
    
    # 5. Determine padding: either 'valid' (0) or 'same' ((length-1)*dilation // 2)
    if np.random.random() < 0.5:
        padding = 0  # Valid convolution
    else:
        padding = (length - 1) * dilation // 2  # Same-ish convolution
    
    # 6. Bias will be set later based on convolution output quantiles
    bias = 0.0  # Placeholder
    
    return {
        'weights': weights,
        'length': length,
        'bias': bias,
        'dilation': dilation,
        'padding': padding
    }

# Example: Generate and display a kernel
example_kernel = generate_rocket_kernel(100, random_state=42)
print("üîß Example ROCKET Kernel:")
for key, value in example_kernel.items():
    print(f"   {key}: {value}")

---

## 7. Low-Level NumPy Implementation

Let's implement ROCKET from scratch to understand every component. This implementation follows the original paper's algorithm.

In [None]:
# 3D visualization
fig_3d = px.scatter_3d(
    x=X_pca[:, 0], y=X_pca[:, 1], z=X_pca[:, 2],
    color=y_all,
    labels={'x': 'PC1', 'y': 'PC2', 'z': 'PC3', 'color': 'Class'},
    title="<b>3D ROCKET Feature Space</b><br><sup>Interactive: Rotate to explore class separability</sup>",
    template="plotly_white"
)
fig_3d.update_traces(marker=dict(size=5, opacity=0.8))
fig_3d.update_layout(height=500)
fig_3d.show()

In [None]:
from sktime.transformations.panel.rocket import Rocket
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Transform time series using ROCKET
rocket = Rocket(num_kernels=1000, random_state=42)  # Fewer kernels for speed
X_train_transformed = rocket.fit_transform(X_train)
X_test_transformed = rocket.transform(X_test)

# Combine for visualization
X_all = np.vstack([X_train_transformed, X_test_transformed])
y_all = np.concatenate([y_train, y_test])
split_labels = ['Train'] * len(y_train) + ['Test'] * len(y_test)

# Standardize and apply PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Create 2D visualization
fig = px.scatter(
    x=X_pca[:, 0], y=X_pca[:, 1],
    color=y_all,
    symbol=split_labels,
    labels={'x': f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)',
            'y': f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)',
            'color': 'Class', 'symbol': 'Split'},
    title="<b>ROCKET Feature Space (PCA Projection)</b><br><sup>Each point is a time series transformed by 1000 random kernels</sup>",
    template="plotly_white"
)
fig.update_traces(marker=dict(size=10, opacity=0.7))
fig.show()

print(f"üìä PCA Explained Variance:")
print(f"   PC1: {pca.explained_variance_ratio_[0]:.2%}")
print(f"   PC2: {pca.explained_variance_ratio_[1]:.2%}")
print(f"   PC3: {pca.explained_variance_ratio_[2]:.2%}")
print(f"   Total (3 PCs): {sum(pca.explained_variance_ratio_[:3]):.2%}")

## 6. Feature Space Visualization (PCA)

After ROCKET transforms time series into 20,000 features, we can visualize the learned feature space using dimensionality reduction. If ROCKET works well, different classes should form separable clusters.

In [None]:
def dilated_convolution_1d(x, weights, dilation, padding=0):
    """
    Perform 1D dilated convolution.
    
    Dilated convolution samples input with spacing = dilation.
    Receptive field = (len(weights) - 1) * dilation + 1
    """
    l = len(weights)
    receptive_field = (l - 1) * dilation + 1
    
    # Add padding
    if padding > 0:
        x_padded = np.pad(x, (padding, padding), mode='constant', constant_values=0)
    else:
        x_padded = x
    
    # Output length
    out_len = len(x_padded) - receptive_field + 1
    output = np.zeros(out_len)
    
    for i in range(out_len):
        # Sample with dilation
        indices = i + np.arange(l) * dilation
        output[i] = np.sum(weights * x_padded[indices.astype(int)])
    
    return output

# Demonstrate dilation effect
kernel = np.array([1, -1, 1, -1, 1])  # Edge detector
dilations = [1, 2, 4, 8]

fig = make_subplots(
    rows=len(dilations) + 1, cols=1,
    subplot_titles=["Original Time Series"] + [f"Dilation = {d} (Receptive Field = {(len(kernel)-1)*d + 1})" for d in dilations],
    vertical_spacing=0.06
)

# Original series
fig.add_trace(
    go.Scatter(x=time_axis, y=sample_series, mode='lines', line=dict(color='#636EFA')),
    row=1, col=1
)

# Convolution outputs with different dilations
colors = ['#00CC96', '#EF553B', '#AB63FA', '#FFA15A']
for idx, d in enumerate(dilations):
    conv_out = dilated_convolution_1d(sample_series, kernel, dilation=d)
    fig.add_trace(
        go.Scatter(x=np.arange(len(conv_out)), y=conv_out, mode='lines', 
                   line=dict(color=colors[idx]), name=f'd={d}'),
        row=idx + 2, col=1
    )

fig.update_layout(
    title="<b>Effect of Dilation on Convolution</b><br><sup>Higher dilation ‚Üí larger receptive field ‚Üí captures longer-range patterns</sup>",
    height=700,
    template="plotly_white",
    showlegend=False
)
fig.show()

# Visualize receptive field
print("\nüîç Receptive Field Visualization:")
print("   Kernel length: 5")
for d in dilations:
    rf = (len(kernel) - 1) * d + 1
    pattern = "".join(["‚óè" if i % d == 0 and i // d < len(kernel) else "‚óã" for i in range(rf)])
    print(f"   Dilation={d}: {pattern} (spans {rf} time steps)")

## 5. Effect of Dilation on Receptive Field

Dilation is a key innovation in ROCKET. It allows small kernels to capture patterns at multiple time scales without increasing the number of parameters.

**Intuition**: 
- **Dilation = 1**: Kernel looks at consecutive time steps (fine-grained patterns)
- **Dilation = 4**: Kernel skips every 3 time steps (captures slower, larger-scale patterns)
- **Receptive field** = $(l-1) \times d + 1$ time steps

In [None]:
# Get a sample time series
sample_series = X_train.iloc[0, 0].values  # First series, first dimension
time_axis = np.arange(len(sample_series))

# Generate a sample kernel
kernel_weights = np.array([1, 0, -1, 0, 1, -1, 1])
bias = 0

# Perform convolution (valid mode, dilation=1)
conv_output = np.convolve(sample_series, kernel_weights[::-1], mode='valid') + bias

# Calculate PPV and Max
ppv = np.mean(conv_output > 0)
gmp = np.max(conv_output)

# Create visualization
fig = make_subplots(
    rows=3, cols=1,
    subplot_titles=["Original Time Series", "Kernel Weights", "Convolution Output"],
    vertical_spacing=0.12
)

# Original series
fig.add_trace(
    go.Scatter(x=time_axis, y=sample_series, mode='lines', name='Original', line=dict(color='#636EFA')),
    row=1, col=1
)

# Kernel
kernel_colors = ['#EF553B' if w < 0 else '#00CC96' if w > 0 else '#AB63FA' for w in kernel_weights]
fig.add_trace(
    go.Bar(x=list(range(len(kernel_weights))), y=kernel_weights, marker_color=kernel_colors, name='Kernel'),
    row=2, col=1
)

# Convolution output with threshold line
conv_time = np.arange(len(conv_output))
fig.add_trace(
    go.Scatter(x=conv_time, y=conv_output, mode='lines', name='Conv Output', line=dict(color='#00CC96')),
    row=3, col=1
)
fig.add_hline(y=0, line_dash="dash", line_color="red", row=3, col=1)

# Add annotation for features
fig.add_annotation(
    x=0.95, y=0.15, xref="paper", yref="paper",
    text=f"<b>PPV = {ppv:.3f}</b><br>GMP = {gmp:.3f}",
    showarrow=False, font=dict(size=14),
    bgcolor="white", bordercolor="black", borderwidth=1
)

fig.update_layout(
    title="<b>How ROCKET Transforms a Time Series</b><br><sup>Kernel slides across series ‚Üí convolution output ‚Üí extract PPV & Max features</sup>",
    height=600,
    template="plotly_white",
    showlegend=False
)
fig.show()

print(f"üìä Feature Extraction Results:")
print(f"   PPV (Proportion of Positive Values): {ppv:.4f}")
print(f"   GMP (Global Max Pooling): {gmp:.4f}")

## 4. Convolution Output Visualization

Let's see how a kernel transforms a time series. The convolution output shows where the pattern appears (high values) and where it doesn't (low values).

In [None]:
# Generate and visualize random kernels
def generate_random_kernel_weights(length):
    """Generate random kernel weights from {-1, 0, 1}"""
    weights = np.random.choice([-1, 0, 1], size=length)
    # Ensure at least one non-zero weight
    if np.all(weights == 0):
        weights[np.random.randint(length)] = np.random.choice([-1, 1])
    return weights

# Generate 8 different kernels with varying lengths
kernel_lengths = [7, 9, 11, 7, 9, 11, 7, 9]
kernels = [generate_random_kernel_weights(l) for l in kernel_lengths]

# Create visualization
fig = make_subplots(rows=2, cols=4, subplot_titles=[f"Kernel {i+1} (len={kernel_lengths[i]})" for i in range(8)])

for idx, kernel in enumerate(kernels):
    row = idx // 4 + 1
    col = idx % 4 + 1
    
    # Create bar chart for kernel weights
    colors = ['#EF553B' if w < 0 else '#00CC96' if w > 0 else '#636EFA' for w in kernel]
    
    fig.add_trace(
        go.Bar(x=list(range(len(kernel))), y=kernel, marker_color=colors, showlegend=False),
        row=row, col=col
    )
    fig.update_yaxes(range=[-1.5, 1.5], row=row, col=col)

fig.update_layout(
    title="<b>Random ROCKET Kernel Weights</b><br><sup>Red = -1, Green = +1, Blue = 0</sup>",
    height=400,
    template="plotly_white"
)
fig

## 3. Visualizing Random Kernels

Let's visualize what random ROCKET kernels look like. Each kernel has weights drawn from {-1, 0, 1} and varying lengths.

**Intuition**: These random patterns act like "detectors" for different shapes in the time series. Some detect edges (alternating +1/-1), some detect level shifts, and some are more complex.

In [None]:
from sktime.transformations.panel.rocket import Rocket
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

# Create pipeline: ROCKET ‚Üí Ridge Classifier
pipe = make_pipeline(
    Rocket(num_kernels=10_000, random_state=42),
    RidgeClassifier(alpha=1.0)
)

# Fit and predict
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)

print("üìä Classification Report (sktime ROCKET):")
print("=" * 50)
print(classification_report(y_test, pred))

## 10. Key Takeaways

| Aspect | Details |
|--------|---------|
| **Core Idea** | Random kernels + simple features (PPV, max) = powerful representations |
| **Kernels** | Weights ‚àà {-1, 0, 1}, lengths ‚àà {7, 9, 11}, random dilation & padding |
| **Dilation** | Enables multi-scale pattern detection without more parameters |
| **Features** | 2 per kernel: PPV (frequency of pattern) + GMP (strongest match) |
| **Speed** | Much faster than learned methods (no backprop), scales linearly |
| **Accuracy** | State-of-the-art on UCR archive with just 10K kernels + Ridge |

### When to Use ROCKET

‚úÖ **Use ROCKET when:**
- You need fast training and inference
- Dataset is small to medium (few hundred to thousands of series)
- You want a strong baseline before trying complex models
- Interpretability of individual kernels is not critical

‚ùå **Consider alternatives when:**
- You need end-to-end differentiable models (use InceptionTime, etc.)
- Series are extremely long (>10K time steps) - consider MiniRocket
- You need kernel interpretability - consider shapelets

---

**Reference**: Dempster, A., Petitjean, F., & Webb, G. I. (2020). ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. *Data Mining and Knowledge Discovery*, 34(5), 1454-1495.