# sktime Data Containers & mtypes

sktime is strict about **data shapes**, because each learning task expects a specific container. This notebook builds intuition for **Series**, **Panel**, and **Hierarchical** data, and shows how to organize time indices and exogenous variables.


## Why containers matter

- **Correctness**: the same algorithm can behave very differently if you feed it the wrong shape.
- **Clarity**: a container encodes *what a row means* (time point vs. instance vs. group).
- **Compatibility**: sktime uses *mtypes* (machine types) to recognize pandas/numpy containers.


## Core scitypes (concepts)

Let $t$ be time, $i$ a series identifier, and $g$ a group.

- **Series**: one sequence $y_t$ (univariate) or $\mathbf{y}_t \in \mathbb{R}^d$ (multivariate).
- **Panel**: multiple sequences $y_{i,t}$ (e.g., many devices).
- **Hierarchical**: grouped panels $y_{g,i,t}$ (e.g., region → store → product).


In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

rng = np.random.default_rng(7)
idx = pd.period_range("2018-01", periods=60, freq="M")
trend = 0.2 * np.arange(len(idx))
season = 2 * np.sin(2 * np.pi * np.arange(len(idx)) / 12)
noise = rng.normal(scale=0.8, size=len(idx))

y = pd.Series(trend + season + noise, index=idx, name="y")


## Series container (univariate)

A **Series** is usually a `pd.Series` with a time index. In sktime, a univariate series is the default input for many forecasters.


In [None]:
fig = px.line(y.to_timestamp(), title="Univariate Series (pd.Series)")
fig.update_layout(xaxis_title="Time", yaxis_title="Value")
fig.show()


## Panel container (many series)

A **Panel** can be represented as a `pd.DataFrame` with a **MultiIndex** `(id, time)`. This format is common for time-series classification/regression.


In [None]:
ids = ["A", "B", "C"]
panel = pd.DataFrame(
    {
        "id": np.repeat(ids, len(idx)),
        "time": np.tile(idx, len(ids)),
        "value": np.tile(y.values, len(ids)) + rng.normal(scale=0.6, size=len(ids) * len(idx)),
    }
).set_index(["id", "time"])

panel_long = panel.reset_index()
fig = px.line(
    panel_long,
    x=panel_long["time"].dt.to_timestamp(),
    y="value",
    color="id",
    title="Panel Data (MultiIndex: id, time)",
)
fig.update_layout(xaxis_title="Time", yaxis_title="Value")
fig.show()


## Hierarchical container (groups of panels)

A **Hierarchical** container adds group levels. Use a MultiIndex like `(group, id, time)` and aggregate along levels as needed.


In [None]:
groups = ["North", "South"]
stores = ["S1", "S2"]
index = pd.MultiIndex.from_product([groups, stores, idx], names=["region", "store", "time"])

values = []
for g in groups:
    for s in stores:
        g_effect = 2.0 if g == "North" else -1.0
        s_effect = 0.5 if s == "S1" else -0.3
        values.append(y.values + g_effect + s_effect + rng.normal(scale=0.4, size=len(idx)))

hier = pd.Series(np.concatenate(values), index=index, name="sales")

# Aggregate to region level for visualization
region = hier.groupby(level=["region", "time"]).sum().reset_index()
fig = px.line(
    region,
    x=region["time"].dt.to_timestamp(),
    y="sales",
    color="region",
    title="Hierarchical Series (region → store → time)",
)
fig.update_layout(xaxis_title="Time", yaxis_title="Sales")
fig.show()


## Exogenous variables (aligned features)

Many sktime estimators accept **exogenous variables** $X_t$ aligned with $y_t$. Keep the same time index and avoid leakage.


In [None]:
X = pd.DataFrame(
    {
        "promo": (rng.random(len(idx)) > 0.8).astype(int),
        "price": 10 + rng.normal(scale=0.3, size=len(idx)),
    },
    index=idx,
)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y.index.to_timestamp(), y=y, name="y"))
fig.add_trace(go.Scatter(x=X.index.to_timestamp(), y=X["price"], name="price", yaxis="y2"))
fig.update_layout(
    title="Target + Exogenous Feature",
    xaxis_title="Time",
    yaxis=dict(title="y"),
    yaxis2=dict(title="price", overlaying="y", side="right"),
)
fig.show()


## mtypes (machine types) in sktime

sktime distinguishes **scitypes** (concepts) from **mtypes** (concrete containers). Common mtypes include:

| Scitype | Typical mtypes | Example |
|---|---|---|
| Series | `pd.Series`, `pd.DataFrame` | one sequence, time index |
| Panel | `pd.DataFrame` (MultiIndex), nested `pd.DataFrame` | multiple sequences |
| Hierarchical | `pd.DataFrame` (MultiIndex), `pd.Series` (MultiIndex) | grouped sequences |

Tip: when debugging, **inspect the index** and ensure time is the *last* level for panel/hierarchical containers.


## Practical checklist

- **Forecasting** → start with a single `pd.Series` (or multivariate `pd.DataFrame`).
- **Classification/Regression** → use a panel container with `(id, time)` index.
- **Hierarchical forecasting** → add grouping levels `(group, id, time)`.
- Keep **time index frequency** consistent and avoid missing timestamps unless the estimator can handle gaps.
