# Efficient Handling of Hierarchical Time Series using Pandas Multi-Indices

- Using Pandas multi-level indexing for convenient manipulation of hierarchical time-series.
- Fast and memory-efficent loading of raw dataset (uses 1.5–2 GB of RAM during processing).
- Calculation of RMSSE and WRMSSE.
- Seasonal naïve and simple neural net (using PyTorch) benchmarks.

I haven't had much use of Pandas' multi-level indexing in the past, but I figured it might be worth giving them a shot for this competition.  Turns out they're pretty handy for hierarchical time series!  

(I saw that someone in the competition has written a package for HTS which I intend to take a look at some time, any
other pointers are also appreciated!)

> I'm assuming some familiarity with the competition and its data.  If you're just starting out, I'd recommend
> [Heads or Tails'](https://www.kaggle.com/headsortails) excellent [EDA kernel](https://www.kaggle.com/headsortails/back-to-predict-the-future-interactive-m5-eda).

The data for the competition consists primarily of 30490 time series of sales data for 3049 items sold in 10 different stores in 3 states.  The items are classified as being in one of 3 categories that are further subdivided into a total of 7 departments.

The representation we'll look at in this notebook is representing each individual time series as a column in a data frame indexed by the day (`d`).

For the individual (level 12 series), we'll index the series in the columns by `(state_id, store_id, cat_id, dept_id, item_id)`.

In [None]:
import numpy as np
import pandas as pd
import csv
from collections import defaultdict

## Create dataset

Using Pandas directly to read the data and reshape it appears to be a bit slow and uses a significant amount of memory.  Instead we'll read the data line by line and store it in NumPy arrays (but we'll try and keep the rest of the code in the notebook nicely vectorized and high-level =).

In [None]:
SALES = "../input/m5-forecasting-accuracy/sales_train_validation.csv"
PRICES = "../input/m5-forecasting-accuracy/sell_prices.csv"
CALENDAR = "../input/m5-forecasting-accuracy/calendar.csv"

# SALES = "../data/raw/sales_train_validation.csv"
# PRICES = "../data/raw/sell_prices.csv"
# CALENDAR = "../data/raw/calendar.csv"

NUM_SERIES = 30490
NUM_TRAINING = 1913
NUM_TEST = NUM_TRAINING + 2 * 28

In [None]:
series_ids = np.empty(NUM_SERIES, dtype=object)
item_ids = np.empty(NUM_SERIES, dtype=object)
dept_ids = np.empty(NUM_SERIES, dtype=object)
cat_ids = np.empty(NUM_SERIES, dtype=object)
store_ids = np.empty(NUM_SERIES, dtype=object)
state_ids = np.empty(NUM_SERIES, dtype=object)

In [None]:
qties = np.zeros((NUM_TRAINING, NUM_SERIES), dtype=float)
sell_prices = np.zeros((NUM_TEST, NUM_SERIES), dtype=float)

### Importing and reshaping sales data

Each row in the sales data consists of six columns for an id of the series together with the five levels item, department, category, store, and, state.

In [None]:
%%time
id_idx = {}
with open(SALES, "r", newline='') as f:
    is_header = True
    i = 0
    for row in csv.reader(f):
        if is_header:
            is_header = False
            continue
        series_id, item_id, dept_id, cat_id, store_id, state_id = row[0:6]
        # Remove '_validation/_evaluation' at end by regenerating series_id
        series_id = f"{item_id}_{store_id}"

        qty = np.array(row[6:], dtype=float)

        series_ids[i] = series_id

        item_ids[i] = item_id
        dept_ids[i] = dept_id
        cat_ids[i] = cat_id
        store_ids[i] = store_id
        state_ids[i] = state_id

        qties[:, i] = qty

        id_idx[series_id] = i

        i += 1

### Importing calendar data

The calendar data has information about which day of the week a given day is, if there are any special events, and most importantly for this notebook, which week (`wm_yr_wk`) the day is in.  We'll need this to get the prices of items, which in turn is necessary in order to calculate the weights we need for estimating our scores.

In [None]:
%%time
wm_yr_wk_idx = defaultdict(list)  # map wmyrwk to d:s
with open(CALENDAR, "r", newline='') as f:
    for row in csv.DictReader(f):
        d = int(row['d'][2:])
        wm_yr_wk_idx[row['wm_yr_wk']].append(d)
        # TODO: Import the rest of the data

### Importing price data

The price data describes the weekly prices for each item in every store.

In [None]:
%%time
with open(PRICES, "r", newline='') as f:
    is_header = True
    for row in csv.reader(f):
        if is_header:
            is_header = False
            continue
        store_id, item_id, wm_yr_wk, sell_price = row
        series_id = f"{item_id}_{store_id}"
        series_idx = id_idx[series_id]
        for d in wm_yr_wk_idx[wm_yr_wk]:
            sell_prices[d - 1, series_idx] = float(sell_price)

### Building DataFrame

We'll store the dataset in two dataframes:

- **`qty_ts`:** sales data.
- **`price_ts`:** prices.

In [None]:
qty_ts = pd.DataFrame(qties,
                      index=range(1, NUM_TRAINING + 1),
                      columns=[state_ids, store_ids,
                               cat_ids, dept_ids, item_ids])

qty_ts.index.names = ['d']
qty_ts.columns.names = ['state_id', 'store_id',
                        'cat_id', 'dept_id', 'item_id']

price_ts = pd.DataFrame(sell_prices,
                        index=range(1, NUM_TEST + 1),
                        columns=[state_ids, store_ids,
                                 cat_ids, dept_ids, item_ids])
price_ts.index.names = ['d']
price_ts.columns.names = ['state_id', 'store_id',
                          'cat_id', 'dept_id', 'item_id']

And if we look at the data, we see how the series are organized into columns:

In [None]:
qty_ts

In [None]:
price_ts

## Aggregation

In this competition, our models are evaluated on 12 different levels defined by combinations of the groupings of the series.  

It is important that we can aggregate our time series, eg., calculate the total sales in each state, so that
we can evaluate a model's per-store item sales data forecasts on every level.

The levels used in the competition are:

In [None]:
LEVELS = {
    1: [],
    2: ['state_id'],
    3: ['store_id'],
    4: ['cat_id'],
    5: ['dept_id'],
    6: ['state_id', 'cat_id'],
    7: ['state_id', 'dept_id'],
    8: ['store_id', 'cat_id'],
    9: ['store_id', 'dept_id'],
    10: ['item_id'],
    11: ['state_id', 'item_id'],
    12: ['item_id', 'store_id']
}

Pandas views all column levels as independent, but here they are not; all series with the same `dept_id` belong to the same `cat_id`, for example.  When grouping our columns, we'll also keep any coarser groupings.

In [None]:
COARSER = {
    'state_id': [],
    'store_id': ['state_id'],
    'cat_id': [],
    'dept_id': ['cat_id'],
    'item_id': ['cat_id', 'dept_id']
}

In [None]:
def aggregate_all_levels(df):
    levels = []
    for i in range(1, max(LEVELS.keys()) + 1):
        level = aggregate_groupings(df, i, *LEVELS[i])
        levels.append(level)
    return pd.concat(levels, axis=1)

def aggregate_groupings(df, level_id, grouping_a=None, grouping_b=None):
    """Aggregate time series by summing over optional levels

    New columns are named according to the m5 competition.

    :param df: Time series as columns
    :param level_id: Numeric ID of level
    :param grouping_a: Grouping to aggregate over, if any
    :param grouping_b: Additional grouping to aggregate over, if any
    :return: Aggregated DataFrame with columns as series id:s
    """
    if grouping_a is None and grouping_b is None:
        new_df = df.sum(axis=1).to_frame()
    elif grouping_b is None:
        new_df = df.groupby(COARSER[grouping_a] + [grouping_a], axis=1).sum()
    else:
        assert grouping_a is not None
        new_df = df.groupby(COARSER[grouping_a] + COARSER[grouping_b] +
                            [grouping_a, grouping_b], axis=1).sum()

    new_df.columns = _restore_columns(df.columns, new_df.columns, level_id,
                                      grouping_a, grouping_b)
    return new_df

A small complication is that Pandas doesn't align during column-wise concatenation, ie., if two dataframes have some different column levels, `pd.concat` does not match levels that are the same between the frames.

The easiest solution is to add back the levels we lost after grouping for now.

In [None]:
def _restore_columns(original_index, new_index, level_id, grouping_a, grouping_b):
    original_df = original_index.to_frame()
    new_df = new_index.to_frame()
    for column in original_df.columns:
        if column not in new_df.columns:
            new_df[column] = None

    # Set up `level` column
    new_df['level'] = level_id

    # Set up `id` column
    if grouping_a is None and grouping_b is None:
        new_df['id'] = 'Total_X'
    elif grouping_b is None:
        new_df['id'] = new_df[grouping_a] + '_X'
    else:
        assert grouping_a is not None
        new_df['id'] = new_df[grouping_a] + '_' + new_df[grouping_b]

    new_index = pd.MultiIndex.from_frame(new_df)
    # Remove "unnamed" level if no grouping
    if grouping_a is None and grouping_b is None:
        new_index = new_index.droplevel(0)
    new_levels = ['level'] + original_index.names + ['id']
    return new_index.reorder_levels(new_levels)

A quick peek at the aggregated sales data:

In [None]:
aggregate_all_levels(qty_ts)

## Evaluation

### Weights

The scoring takes into account the final month's total sales and weights the series on every level accordingly.

In [None]:
def calculate_weights(totals):
    """Calculate weights from total sales.

    Uses all data in the dataframe so remember to calculate total sales
    (quantity times sell price) and .

    :param totals: Total sales
    :return: Series of weights with (level, *_id, id:) as multi-index
    """
    summed = aggregate_all_levels(totals).sum()
    
    return summed / summed.groupby(level='level').sum()

> **NB.** I'm writing this notebook when the public leaderboard is based on the actual final month (strictly speakin, the final 28 day period) of the training data, therefore the weights are actually calculated using the month before that.  A bit confusing, I know.

In [None]:
final_month_totals = (qty_ts.loc[NUM_TRAINING - 28 + 1:NUM_TRAINING + 1] *
                      price_ts.loc[NUM_TRAINING - 28 + 1:NUM_TRAINING + 1])

weights = calculate_weights(final_month_totals)

(We can compare the weights with [the validation weights in the M5 repo](https://raw.githubusercontent.com/Mcompetitions/M5-methods/master/validation/weights_validation.csv) to check that the weights have been calculated correctly.)

### Scales

For the Root Mean Squared Scale Error metric used in the competion, we need to compute scales using time series data up to the forecast.  

> **NB.** We have to be careful not to use scale values using data from after the forecasting has begun since that would leak information from the future.  For the same reason, we can't use the weights we calculated above during training since they are based on the last period.

For each day, we calculate the scales of all series upto that day.  Scales are essentially defined by the mean squared difference between consecutive days (the Scaled Pinball Loss used in the companion to this competition uses absolute differences instead of squared).

In [None]:
def cumulative_scales(history, f):
    """Calculate column-wise cumulative scales.
    
    :param history: Values (in day-order)
    :param f: Function to apply to differeces, eg., square for RMSSE, abs for SPL"""
    # Number of values after the first non-zero
    ns = (history.cumsum() > 0).cumsum().shift(1, fill_value=0)
    scales = f(history - history.shift(1)).cumsum() / ns
    
    # Fill parts where no sales with ∞ (effectively ignore series there)
    return scales.fillna(np.inf)


def cumulative_squared_scales(history):
    """Calculate column-wise cumulative scales for RMSSE (squared)."""
    return cumulative_scales(history, np.square)

In [None]:
def calculate_scales(history):
    """Calculate scales using all of history."""
    return cumulative_squared_scales(history).iloc[-1]

### RMSSE and WRMSSE

The metric in this competition sort of compares the models performance to a naive model that always predicts that the next day will be the same as the current day:

$$
\mathrm{RMSSE} = \sqrt{
\frac{1}{h} \frac{\sum_{t = n + 1}^{n + h}(Y_t - \hat{Y}_t)^2}{\frac{1}{n - 1}\sum_{t = 2}^{n}(Y_t - Y_{t - 1})^2}
}.
$$

$Y_t$ is the actual value at $t$, $\hat{Y}_t$ the forecasted value, $n$ the number of values, and, $h$ the forecasting horizon.

In [None]:
def evaluate_rmsse(actual_full, forecast_full, history_full):
    scale = calculate_scales(history_full)

    rmsse = ((actual_full - forecast_full).pow(2).mean() / scale) \
        .pow(1 / 2)
    return rmsse

def evaluate_all_rmsse(actual, forecast, history):
    """Evaluate per-series RMSSE after aggregation"""
    actual_full = aggregate_all_levels(actual)
    forecast_full = aggregate_all_levels(forecast)
    history_full = aggregate_all_levels(history)

    return evaluate_rmsse(actual_full, forecast_full, history_full)

def evaluate_rmsse_wrmsse_per_level(actual, forecast, history, weights):
    """Aggregate series and return per-level RMSSE"""
    rmsse = evaluate_all_rmsse(actual, forecast, history)
    # Average per-series RMSSE over levels
    return rmsse.mean(level='level'), (weights * rmsse).sum(level='level')

For fun, we can take the sales data for the final month and add some noise and see how large RMSSE and WRMSSE that gives us. (This could be useful in order to get an idea of how good our predictions are.)

In [None]:
final_month = qty_ts.loc[NUM_TRAINING - 28 + 1:NUM_TRAINING + 1]
final_month_noise = np.clip(final_month + np.random.normal(loc=0.0, scale=0.5, size=(28, 30490)), 0, None)

In [None]:
noise_rmsse, noise_wrmsse = evaluate_rmsse_wrmsse_per_level(final_month, final_month_noise, 
                                 qty_ts.loc[:NUM_TRAINING - 28 + 1], weights)

In [None]:
noise_rmsse

In [None]:
noise_rmsse.mean()

In [None]:
noise_wrmsse

In [None]:
noise_wrmsse.mean()

## Benchmarks

We'll evaluate the model using the final month of the training data as our validation set.  For better CV, 
we should really split the data into more pieces, but then we have to take into account the way different
items are introduced in different stores at different times.  For the final month, we know that all items
have been available for at least a couple of months.

In [None]:
qty_train = qty_ts.loc[:NUM_TRAINING - 28 + 1]
qty_test = qty_ts.loc[NUM_TRAINING - 28 + 1:NUM_TRAINING + 1]

def evaluate_model(model):
    model.fit(None, qty_train)
    qty_pred = model.forecast(None, 28)
    _, wrmsses = evaluate_rmsse_wrmsse_per_level(qty_test, qty_pred, qty_train, weights)
    return wrmsses.mean()

### Seasonal Naïve

Let's start with a simple model which repeats the last `period` observations. 

In [None]:
class SeasonalNaive(object):
    def __init__(self, period):
        self.period = period

    def fit(self, features, target):
        self.history = target.iloc[-self.period:]
        self.d = self.history.index[-1]

        return self

    def forecast(self, features, h):
        """Forecast the next h days"""
        fs = []
        for i in range(h):
            self.d += 1
            assert self.history.index[0] + self.period == self.d
            f = self.history.iloc[0:1]
            f.index = [self.d]
            fs.append(f)
            self.history = self.history.iloc[1:].append(f)
        return pd.concat(fs)

There is some seasonality in the data, for example, weekly and 4-weekly.  Lets try and take advantage of it.

In [None]:
evaluate_model(SeasonalNaive(7))

In [None]:
evaluate_model(SeasonalNaive(28))

### Neural Net

As a hopefully more interesting example of how we might use the data organized in this form, let's try a simple 2-layer neural network using PyTorch.

We'll implement a custom RMSSE loss and try to make sure we move all of our data into the GPU before training.

Let's assume that the relationship between past and future values are the same for all series, so that we can just bundle up all of the series and train the model on all at once.  The model will forecast all 28 days at once.

> Since this is just meant as a simple example, not much thought has been put into the model, and no effort has been
> made to make results reproducible.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

In [None]:
ACTIVATION = {
    'relu': F.relu,
    'tanh': F.tanh,
    'sigmoid': F.sigmoid,
    'linear': lambda x: x
}

ACTIVATION_FUNCTIONS = list(ACTIVATION.keys())

In [None]:
def hidden_init(layer):
    fan_in = layer.weight.data.size()[0]
    lim = 1. / np.sqrt(fan_in)
    return (-lim, lim)


class Network(nn.Module):
    def __init__(self, lookback, layer_1_size, layer_1_activation, layer_2_size,
                 layer_2_activation):
        """Initialize parameters and build model."""
        super().__init__()
        self.fc1 = nn.Linear(lookback, layer_1_size)
        self.d1 = nn.Dropout()
        self.f1 = ACTIVATION[layer_1_activation]

        self.fc2 = nn.Linear(layer_1_size, layer_2_size)
        self.f2 = ACTIVATION[layer_2_activation]
        self.d2 = nn.Dropout()

        self.fc3 = nn.Linear(layer_2_size, FORECAST_DAYS)

        self.initialize_weights()

    def initialize_weights(self):
        """Initializes the weights with random values"""
        self.fc1.weight.data.uniform_(*hidden_init(self.fc1))
        self.fc2.weight.data.uniform_(*hidden_init(self.fc2))
        self.fc3.weight.data.uniform_(-3e-3, 3e-3)

    def forward(self, qties):
        x = self.fc1(qties)
        x = self.d1(x)
        x = self.f1(x)

        x = self.fc2(x)
        x = self.d2(x)
        x = self.f2(x)

        x = self.fc3(x)

        return x

The model has a couple of parameters: 

- the network architecture is described by the size and activation functions of the layers, 
- the number of previous steps the model uses in order to forecast the future is given by `lookback`, and,
- the model trains on all series for a series of `batches`, each offset by one day.

(Handling scales and making sure all tensors have the correct size is kind of tricky.)

In [None]:
def rmsse_loss(input, target, scales):
    return (((input - target)**2 / scales).sum() / input.data.nelement()).sqrt()

FORECAST_DAYS = 28

class NeuralNet(object):
    def __init__(self, lookback,
                 layer_1_size, layer_1_activation,
                 layer_2_size, layer_2_activation,
                 batches, shuffle,
                 epochs,
                 device):
        self.device = device

        self.lookback = lookback
        self.layer_1_size = layer_1_size
        self.layer_1_activation = layer_1_activation
        self.layer_2_size = layer_2_size
        self.layer_2_activation = layer_2_activation
        self.batches = batches
        self.shuffle = shuffle

        self.epochs = epochs

    def fit(self, features, target):
        """Attempts to predict the last 28 days"""
        y = (target.iloc[-(FORECAST_DAYS + self.batches):].values
             .transpose())
        X = (target.iloc[-(FORECAST_DAYS + self.lookback
                           + self.batches):-FORECAST_DAYS]
             .values.transpose())

        y = torch.from_numpy(y).float().to(self.device)
        X = torch.from_numpy(X).float().to(self.device)

        # Calculate scales (remember to avoid leaks from the future!)
        scales = cumulative_squared_scales(target) \
                     .values[
                 -(FORECAST_DAYS + self.batches):-(FORECAST_DAYS - 1)]
        scales = scales.transpose()
        scales = torch.from_numpy(scales).float().to(self.device)


        net = Network(self.lookback,
                      self.layer_1_size,
                      self.layer_1_activation,
                      self.layer_2_size,
                      self.layer_2_activation).to(self.device)
        self.net = net

        optimizer = optim.Adam(net.parameters())

        for epoch in tqdm(range(self.epochs)):
            running_loss = 0.0

            batch_idxs = np.arange(self.batches + 1)
            if self.shuffle:
                np.random.shuffle(batch_idxs)
            for i in batch_idxs:
                optimizer.zero_grad()

                X_run = X[:, i:(i + self.lookback)]
                y_run = y[:, i:(i + FORECAST_DAYS)]
                scales_run = scales[:, i:(i + 1)]


                forecast = net(X_run)

                loss = rmsse_loss(forecast, y_run, scales_run)
                loss.backward()
                optimizer.step()

                running_loss += loss.item()

            mean_loss = running_loss / self.batches
            # print(f"Epoch {epoch + 1}: Loss {mean_loss:.2f}")

        # Store history
        self.history = target.iloc[-self.lookback:]
        self.d = self.history.index[-1]

        return self

    def forecast(self, features, h):
        # For now, only handle full period
        assert h == FORECAST_DAYS

        assert h <= FORECAST_DAYS

        with torch.no_grad():
            X = self.history.values.transpose()
            X = torch.from_numpy(X).float().to(self.device)
            forecast = self.net(X).cpu().numpy()

            forecast = forecast.transpose()
            self.d += 1

            # TODO: Update self.d properly

            forecast = pd.DataFrame(forecast,
                                    index=range(self.d, self.d + h),
                                    columns=self.history.columns)
            # Remove any negative values
            forecast = forecast.clip(lower=0)

            # TODO: Truncate to h days only and store into history
            self.d += 1 + h
            self.history = forecast
            return forecast

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Let's try some random values for the parameters and see what kind of performance we get:

In [None]:
nnet = NeuralNet(lookback=140, 
                 layer_1_size=512, layer_1_activation='relu',
                 layer_2_size=256, layer_2_activation='relu',
                 batches=7, shuffle=True, epochs=64, 
                 device=device)

In [None]:
evaluate_model(nnet)

### Ensemble

Another fun experiment we can make is to create simple ensemble that take the mean of all forecasts by some models.

In [None]:
from functools import reduce
import operator

class Ensemble(object):
    def __init__(self, models):
        self.models = models
    
    def fit(self, features, target):
        for model in self.models:
            model.fit(features, target)
    
    def forecast(self, features, h):
        return reduce(operator.add, 
                      [model.forecast(features, h) for model in self.models]) / len(self.models)

It seems likely that an ensemble of seasonal naïve predictors would perform better than a single and it turns out to be quite an improvement!:

In [None]:
naive_ensemble = Ensemble([SeasonalNaive(7), SeasonalNaive(28)])

In [None]:
evaluate_model(naive_ensemble)

We can also include our neural net in the mix.  Note that by reusing the previous ensemble, we're essentially given the seasonal naïve predictors individual weights of 0.25, and the neural net 0.5.  We could try to optimize the weights, but we probably want a better validation scheme for that.

In [None]:
large_ensemble = Ensemble([naive_ensemble, nnet])

In [None]:
evaluate_model(large_ensemble)

I have run this ensemble a number of times, and sometimes it performs a lot better than the naïve ensemble, and other times it performs worse.  Even though our validation isn't really good enough, it seems as if we could squeeze out some extra performance using a larger ensemble. 

Just for fun, let's try a much larger one:

In [None]:
huge_ensemble =  Ensemble([
    SeasonalNaive(7), 
    SeasonalNaive(14), 
    SeasonalNaive(21), 
    SeasonalNaive(28),
    SeasonalNaive(56),
    NeuralNet(lookback=140,
              layer_1_size=512, layer_1_activation='relu',
              layer_2_size=256, layer_2_activation='relu',
              batches=7, shuffle=True, epochs=64, 
              device=device),
    NeuralNet(lookback=365, 
              layer_1_size=1024, layer_1_activation='relu',
              layer_2_size=512, layer_2_activation='relu',
              batches=140, shuffle=True, epochs=64, 
              device=device)])

In [None]:
evaluate_model(huge_ensemble)

## Submission

Let's use the huge ensemble above and create a submission using it.

In [None]:
%%time
huge_ensemble.fit(None, qty_ts)
qty_pred = huge_ensemble.forecast(None, 28)

In [None]:
def convert_to_submission(forecast):
    """Convert level 12-predictions to submssion"""
    df = aggregate_all_levels(qty_pred)\
        .transpose()\
        .reset_index(level=['level', 'state_id', 'store_id', 'cat_id', 'dept_id', 'item_id'],
                    drop=True)
    df.columns = [f"F{i}" for i in range(1, 29)]
    validation = df
    evaluation = df.copy()
    
    validation.index += "_validation"
    evaluation.index += "_evaluation"
    
    return pd.concat([validation, evaluation])

In [None]:
submission = convert_to_submission(qty_pred)

In [None]:
# You can't submit zip-files directly from notebooks, otherwise one could use this instead:
# submission.to_csv("submission.zip")
submission.to_csv("submission.csv")