This notebook shows my solution to the M5 Forecasting - Accuracy challenge.  

The basic ideas behind my solution was the following: 

* I evaluated, what a good score would be and I find out, that a discrepandacy of around 10% per item results in a score of around ~0.5. Since the task was to predict each item per day, I assumed that a different of 10% is accetable. Moreover, there is a random factor. Many items are not sold on many days and therefore, I assumed that it is partly random, if it will be sold or not. For items, which will be sold 10 times per day, a 10% discrapandacy means that it will sold between 9 and 11 times. I want to say, it is in the nature of the data, that a perfect prediction is impossible and I assumed that 10% is accetable and seeing the final best scores, shows that I might be right. 
* I make the assumption that predicting the number of sold items per store and department is easier and more stable than for each item. Furthermore, checking out the weights for the final evaluation shows that having the predictions correct for this higher level data, has the highest influence. Hence, having the prediction right there, then a 50% of the weighted scores are correct. 
* Focusing on the higher level data has also the advantages that it requires less memory. 
* For predicting the higher level data, I used a lightgbm classifier. I didn't much optimize the hyperparamters since the results were accetable from the beginning. Using lightgbm has the disadvantages that it is not a linear prediction and therefor, I could not forecasting a trend. However, I visually inspect the (high level) data and there might be a very slight trend but not so much. Furthermore, it was only a prediction of 28 days, hence I assumed that a trend would not have hugh impact and therefore, I ignore it. 
* For the prediction on an item level, I took the number of sold items per store & department and distribute it over all items. I calculate how many items (relative the total number per store & department) were sold in the last 28 days and multiply this relative number with the prediction sold items. 

Finally, I think there are a few flaws in the task and evaluation method. From a customer perspective, I assume it is important to know, how many items should be on storage. For items which are sold very often per day, a prediction per day might be the correct one. However, for items, which only sold once per week or even less, it does not make much seens trying to predict it on a daily base. 

In [None]:
from  datetime import datetime, timedelta
import numpy as np, pandas as pd
from typing import Union
import numpy as np
import pandas as pd
import functools
import lightgbm as lgb
import matplotlib.pyplot as plt

# Evaluation class
Adjuste from https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133834

Thanks to [sakami](https://www.kaggle.com/sakami)

In [None]:
class WRMSSEEvaluator(object):
    # Adjuste from https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133834
    # Thanks to sakami
    
    group_ids = ( 'all_id', 'state_id', 'store_id', 'cat_id', 'dept_id', 'item_id',
        ['state_id', 'cat_id'],  ['state_id', 'dept_id'], ['store_id', 'cat_id'],
        ['store_id', 'dept_id'], ['item_id', 'state_id'], ['item_id', 'store_id'])

    def __init__(self, 
                 train_df: pd.DataFrame, 
                 valid_df: pd.DataFrame, 
                 calendar: pd.DataFrame, 
                 prices: pd.DataFrame):
        '''
        intialize and calculate weights
        '''
        self.disp = False
        self.calendar = calendar
        self.prices = prices
        self.train_df = train_df
        self.valid_df = valid_df
        self.train_target_columns = [i for i in self.train_df.columns if i.startswith('d_')]

        self.train_df['all_id'] = "all"

        self.id_columns = [i for i in self.train_df.columns if not i.startswith('d_')]
        self.valid_target_columns = [i for i in self.valid_df.columns if i.startswith('d_')]

        if not all([c in self.valid_df.columns for c in self.id_columns]):
            self.valid_df = pd.concat([self.train_df[self.id_columns], self.valid_df],
                                      axis=1, 
                                      sort=False)
        self.train_series = self.trans_30490_to_42840(self.train_df, 
                                                      self.train_target_columns, 
                                                      self.group_ids)
        self.valid_series = self.trans_30490_to_42840(self.valid_df, 
                                                      self.valid_target_columns, 
                                                      self.group_ids)
        self.scale = self.get_scale()
        
    def get_train_set_preyear(): 
        # Validation on 1577, 1605 (starts at 25.05.2015)
        # Training on 1 to 1576
        return self.train_df    
    
    def get_train_set_premonth(): 
        # Validation on 1914, 1941 (starts on 25.04.2016)
        return self.train_df
    
    def get_train_set_final(): 
        # No validation, final submission (starts on 23.05.2016)
        # Submission from 1941 to 1969
        return self.train_df

    def get_scale(self):
        '''
        scaling factor for each series ignoring starting zeros
        '''
        scales = []
        for i in range(len(self.train_series)):
            series = self.train_series.iloc[i].values
            series = series[np.argmax(series!=0):]
            scale = ((series[1:] - series[:-1]) ** 2).mean()
            scales.append(scale)
        return np.array(scales)
    
    def get_name(self, i):
        '''
        convert a str or list of strings to unique string 
        used for naming each of 42840 series
        '''
        if type(i) == str or type(i) == int:
            return str(i)
        else:
            return "--".join(i)
    
    @functools.lru_cache(maxsize=3)
    def get_weight_df(self, weight_columns_start) -> pd.DataFrame:
        """
        returns weights for each of 42840 series in a dataFrame
        """
        weight_columns = ['d_%d' % (x) for x in range(weight_columns_start, weight_columns_start+28)]
        day_to_week = self.calendar.set_index("d")["wm_yr_wk"].to_dict()
        weight_df = self.train_df[["item_id", "store_id"] + weight_columns].set_index(
            ["item_id", "store_id"]
        )
        weight_df = (
            weight_df.stack().reset_index().rename(columns={"level_2": "d", 0: "value"})
        )
        weight_df["wm_yr_wk"] = weight_df["d"].map(day_to_week)
        weight_df = weight_df.merge(
            self.prices, how="left", on=["item_id", "store_id", "wm_yr_wk"]
        )
        weight_df["value"] = weight_df["value"] * weight_df["sell_price"]
        weight_df = weight_df.set_index(["item_id", "store_id", "d"]).unstack(level=2)[
            "value"
        ]
        weight_df = weight_df.loc[
            zip(self.train_df.item_id, self.train_df.store_id), :
        ].reset_index(drop=True)
        weight_df = pd.concat(
            [self.train_df[self.id_columns], weight_df], axis=1, sort=False
        )
        weights_map = {}
        for i, group_id in enumerate(self.group_ids):
            lv_weight = weight_df.groupby(group_id)[weight_columns].sum().sum(axis=1)
            lv_weight = lv_weight / lv_weight.sum()
            for i in range(len(lv_weight)):
                weights_map[self.get_name(lv_weight.index[i])] = np.array(
                    [lv_weight.iloc[i]]
                )
        weights = pd.DataFrame(weights_map).T / len(self.group_ids)

        return weights

    def trans_30490_to_42840(self, df, cols, group_ids, dis=False):
        '''
        transform 30490 series to all 42840 series
        '''
        series_map = {}
        for i, group_id in enumerate(self.group_ids):
            tr = df.groupby(group_id)[cols].sum()
            for i in range(len(tr)):
                series_map[self.get_name(tr.index[i])] = tr.iloc[i].values
        return pd.DataFrame(series_map).T
    
    def get_rmsse(self, valid_preds) -> pd.Series:
        '''
        returns rmsse scores for all 42840 series
        '''
        score = ((self.valid_series - valid_preds) ** 2).mean(axis=1)
        rmsse = (score / self.scale).map(np.sqrt)
        return rmsse

    def score(self, valid_preds: Union[pd.DataFrame, np.ndarray], weight_columns) -> float:
        assert self.valid_df[self.valid_target_columns].shape == valid_preds.shape, f"{self.valid_df[self.valid_target_columns].shape} vs {valid_preds.shape}"

        if isinstance(valid_preds, np.ndarray):
            valid_preds = pd.DataFrame(valid_preds, columns=self.valid_target_columns)

        valid_preds = pd.concat([self.valid_df[self.id_columns].set_index(self.valid_df.id), valid_preds], #.set_index(self.valid_df.id)
                                axis=1, 
                                sort=False)
        valid_preds = self.trans_30490_to_42840(valid_preds, 
                                                self.valid_target_columns, 
                                                self.group_ids, 
                                                False)
        self.rmsse = self.get_rmsse(valid_preds)
        self.contributors = pd.concat([self.get_weight_df(weight_columns), self.rmsse], 
                                      axis=1, 
                                      sort=False).prod(axis=1)
        return np.sum(self.contributors)

# Calculating baseline score

In [None]:
train_df = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_evaluation.csv')
calendar = pd.read_csv('../input/m5-forecasting-accuracy/calendar.csv')
prices = pd.read_csv('../input/m5-forecasting-accuracy/sell_prices.csv')

In [None]:
start_time = 1914
val_columns = ['d_%d' % (x) for x in range(start_time, start_time+28)]
valid_fold_df = train_df.loc[:, val_columns].copy()
e = WRMSSEEvaluator(train_df, valid_fold_df, calendar, prices)

Test ground truth. Error should be 0 

In [None]:
valid_gt_df = train_df.set_index('id').loc[:, val_columns].copy()
e.score(valid_gt_df, 1886)

 Since the task was to predict each item per day, I assumed that a different of 10% is accetable. Moreover, there is a random factor. Many items are not sold on many days and therefore, I assumed that it is partly random, if it will be sold or not. For items, which will be sold 10 times per day, a 10% discrapandacy means that it will sold between 9 and 11 times. It is in the nature of the data, that a perfect prediction is impossible and I assumed that 10% is accetable. Therefore, I calculate the score by increasing and decreasing the number of sold items by 10%.

In [None]:
e.score(valid_gt_df*1.1, 1886), e.score(valid_gt_df*0.9, 1886)

Here, we can see that a discrepandacy of around 10% per item results in a score of around ~0.54. I assume that is a good baseline score, which I should targeting. Targeting a lower score would result in the risk of overfitting or just having luck on the public leaderboard. However, on the private leadboard it might look complete different. Therefore, I was targeting an error of arround 0.5. 

After the final evaluation, the best score on the private leaderboard was 0.52043, which shows that my assumption was correct. 

# Read Data

Adjuste from kkiller

In [None]:
CAL_DTYPES={"event_name_1": "category", "event_name_2": "category", "event_type_1": "category", 
         "event_type_2": "category", "weekday": "category", 'wm_yr_wk': 'int16', "wday": "int16",
        "month": "int16", "year": "int16", "snap_CA": "float32", 'snap_TX': 'float32', 'snap_WI': 'float32' }
PRICE_DTYPES = {"store_id": "category", "item_id": "category", "wm_yr_wk": "int16","sell_price":"float32" }

def create_dt(is_train = True, nrows = None, first_day = 0, tr_last=1941, max_lags=57, skip_prices=False):
    prices = pd.read_csv("../input/m5-forecasting-accuracy/sell_prices.csv", dtype = PRICE_DTYPES)
    for col, col_dtype in PRICE_DTYPES.items():
        if col_dtype == "category":
            prices[col] = prices[col].cat.codes.astype("int16")
            prices[col] -= prices[col].min()

    cal = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv", dtype = CAL_DTYPES)
    cal["date"] = pd.to_datetime(cal["date"])
    for col, col_dtype in CAL_DTYPES.items():
        if col_dtype == "category":
            cal[col] = cal[col].cat.codes.astype("int16")
            cal[col] -= cal[col].min()

    start_day = max(1 if is_train  else tr_last-max_lags, first_day)
    numcols = [f"d_{day}" for day in range(start_day,tr_last+1)]
    catcols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']
    dtype = {numcol:"float32" for numcol in numcols} 
    dtype.update({col: "category" for col in catcols if col != "id"})
    dt = pd.read_csv("../input/m5-forecasting-accuracy/sales_train_evaluation.csv", 
                     nrows = nrows, usecols = catcols + numcols, dtype = dtype)

    for d in range(1942, 1970): 
        dt['d_%d'%d] = np.nan
    
    for col in catcols:
        if col != "id":
            dt[col] = dt[col].cat.codes.astype("int16")
            dt[col] -= dt[col].min()

    if not is_train:
        for day in range(tr_last+1, tr_last+ 28 +1):
            dt[f"d_{day}"] = np.nan

    dt = pd.melt(dt,
                  id_vars = catcols,
                  value_vars = [col for col in dt.columns if col.startswith("d_")],
                  var_name = "d",
                  value_name = "sales")

    dt = dt.merge(cal, on= "d", copy = False)
    if not skip_prices: 
        dt = dt.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"], copy = False)

    return dt

# Data seperation 
I seperate the data into highlevel data (i.e. the total number per Store and department) and low level data (i.e. per item) 

I make the assumption that predicting the number of sold items per store and department is easier and more stable than for each item. Furthermore, checking out the weights for the final evaluation shows that having the predictions correct for this higher level data, has the highest influence. Hence, having the prediction right there, then a 50% of the weighted scores are correct. Last but not least, cocusing on the higher level data has also the advantages that it requires less memory. 

In [None]:
def get_highlevel_data():
    cal = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv", dtype = CAL_DTYPES)
    cal["date"] = pd.to_datetime(cal["date"])
    for col, col_dtype in CAL_DTYPES.items():
        if col_dtype == "category":
            cal[col] = cal[col].cat.codes.astype("int16")
            cal[col] -= cal[col].min()
            

    numcols = [f"d_{day}" for day in range(1,1969)]
    catcols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']
    dtype = {numcol:"float32" for numcol in numcols} 
    dtype.update({col: "category" for col in catcols if col != "id"})
    dt = pd.read_csv("../input/m5-forecasting-accuracy/sales_train_evaluation.csv", dtype = dtype)
    for col in catcols:
        if col != "id":
            dt[col] = dt[col].cat.codes.astype("int16")
            dt[col] -= dt[col].min()
    
    for d in range(1942, 1970): 
        dt['d_%d'%d] = np.nan
        
    dt1 = dt.groupby(['dept_id', 'store_id']).sum()
    dt1.drop(columns=['cat_id','state_id', 'item_id'], inplace=True)
    dt1 = dt1.join(dt[['dept_id', 'store_id', 'cat_id', 'state_id']].groupby(['dept_id', 'store_id']).mean())
    dt = pd.melt(dt1.reset_index(),
              id_vars = ['dept_id', 'store_id', 'cat_id', 'state_id'],
              value_vars = [col for col in dt.columns if col.startswith("d_")],
              var_name = "d",
              value_name = "sales")
    dt = dt.merge(cal, on= "d", copy = False) # , how='outer')
    return dt

def get_highestlevel_data():
    raise NotImplementedError
    cal = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv", dtype = CAL_DTYPES)
    cal["date"] = pd.to_datetime(cal["date"])
    for col, col_dtype in CAL_DTYPES.items():
        if col_dtype == "category":
            cal[col] = cal[col].cat.codes.astype("int16")
            cal[col] -= cal[col].min()
    numcols = [f"d_{day}" for day in range(1,1969)]
    catcols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']
    dtype = {numcol:"float32" for numcol in numcols} 
    dtype.update({col: "category" for col in catcols if col != "id"})
    dt = pd.read_csv("../input/m5-forecasting-accuracy/sales_train_evaluation.csv", dtype = dtype)
    for col in catcols:
        if col != "id":
            dt[col] = dt[col].cat.codes.astype("int16")
            dt[col] -= dt[col].min()
    dt = dt.groupby(['cat_id', 'state_id']).sum()
    dt = pd.melt(dt.reset_index(),
              id_vars = ['cat_id', 'state_id'],
              value_vars = [col for col in dt.columns if col.startswith("d_")],
              var_name = "d",
              value_name = "sales")
    dt = dt.merge(cal, on= "d", copy = False) # , how='outer')
    return dt

def get_toplevel_data():
    raise NotImplementedError
    cal = pd.read_csv("../input/m5-forecasting-accuracy/calendar.csv", dtype = CAL_DTYPES)
    cal["date"] = pd.to_datetime(cal["date"])
    for col, col_dtype in CAL_DTYPES.items():
        if col_dtype == "category":
            cal[col] = cal[col].cat.codes.astype("int16")
            cal[col] -= cal[col].min()
    numcols = [f"d_{day}" for day in range(1,1969)]
    catcols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']
    dtype = {numcol:"float32" for numcol in numcols} 
    dtype.update({col: "category" for col in catcols if col != "id"})
    dt = pd.read_csv("../input/m5-forecasting-accuracy/sales_train_evaluation.csv", dtype = dtype)
    for col in catcols:
        if col != "id":
            dt[col] = dt[col].cat.codes.astype("int16")
            dt[col] -= dt[col].min()
    dt = dt.sum().to_frame()
    dt.drop(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], inplace=True)
    dt = dt.reset_index()
    dt.columns = ['d', 'sales']
    dt = dt.merge(cal, on= "d" , copy = False) # , how='outer')
    return dt


# Features
I use mainly two different features

1. For each item, the classifiere get the information how often it was sold in the last N days. 
2. The dataset provides also information about events. However, an event might have not only influence on the day of the event but maybe also the previous or the following days. Therefore, I also extend the event to N previous and following days. 

In [None]:

def get_lag_features(dt, lag_length=29): 
    groupfeatures = ['dept_id', 'state_id', 'store_id', 'cat_id']
    groupfeatures = [x for x in groupfeatures if x in dt.columns]
    for lag in range(1, lag_length+1):
        try: 
            dt['lag%d' % (lag)] = dt.groupby(groupfeatures)['sales'].shift(lag)
        except ValueError: 
            dt['lag%d' % (lag)] = dt['sales'].shift(lag)
    return dt

def get_lag_events(dt, lag_length=3): 
    event_features = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
    for event_feature in event_features: 
        for lag in range(-lag_length, lag_length): 
            if lag == 0: 
                continue
            dt['%s_%d' % (event_feature, lag)] = dt.groupby(['dept_id', 'store_id'])[event_feature].shift(lag)
            dt['%s_%d' % (event_feature, lag)].fillna(0, inplace=True)
    return dt     

# Training 
Training a lgb classifier on the high-level data

In [None]:
def train_lgb(df, start_valid=1886, end_valid=1914, 
              objective=None, learning_rate=0.075, boosting_type=None, num_iterations=1200): 
    if objective is None: 
        objective = "poisson"
    if boosting_type is None: 
        boosting_type = 'gbdt'
    df = df.dropna()
    cat_feats_all = ['dept_id','store_id', 'cat_id', 'store_id'] + ["event_name_1", "event_name_2", "event_type_1", "event_type_2"]
    cat_feats = [x for x in cat_feats_all if x in df.columns]
    useless_cols = ["id", "date", "sales","d", "wm_yr_wk", "weekday"]
    train_cols = df.columns[~df.columns.isin(useless_cols)]
    X_train = df[train_cols]
    y_train = df["sales"]
    np.random.seed(777)
    fake_valid_inds = df[df.d.isin(['d_%d'%x for x in range(start_valid, end_valid)])].index # np.random.choice(X_train.index.values, 1318, replace = False)
    nonused_inds = df[df.d.isin(['d_%d'%x for x in range(end_valid, 2000)])].index # np.random.choice(X_train.index.values, 1318, replace = False)

    train_inds = np.setdiff1d(X_train.index.values, fake_valid_inds)
    train_inds = np.setdiff1d(train_inds, nonused_inds)
    train_data = lgb.Dataset(X_train.loc[train_inds] , label = y_train.loc[train_inds], 
                             categorical_feature=cat_feats, free_raw_data=False)
    fake_valid_data = lgb.Dataset(X_train.loc[fake_valid_inds], label = y_train.loc[fake_valid_inds],
                                  categorical_feature=cat_feats,
                     free_raw_data=False)
    params = {
            "objective" : objective,
            "metric" :"rmse",
            "force_row_wise" : True,
            "learning_rate" : learning_rate,
            "boosting_type" : boosting_type, 
            "sub_row" : 0.75,
            "bagging_freq" : 1,
            "lambda_l2" : 0.1,
            "metric": ["rmse"],
        'verbosity': 1,
        'num_iterations' : num_iterations,
        'num_leaves': 128,
        "min_data_in_leaf": 100,
    }
    m_lgb = lgb.train(params, train_data, valid_sets = [fake_valid_data], verbose_eval=20) 
    return m_lgb, train_cols, 

# Prediction of high level data

In [None]:
def predict(m_lgb, train_cols, df, valid_start_date=1886, pred_length=28, lag_length=29, feature_fcts=None): 
    if feature_fcts is None: 
        feature_fcts = []
    df_valid = df[df.d.isin(['d_%d'%x for x in range(valid_start_date-lag_length, valid_start_date+pred_length)])]
    df_valid.loc[df_valid.d.isin(['d_%d'%x for x in range(valid_start_date, valid_start_date+pred_length)]), 'sales'] = np.NAN
    for idx in range(pred_length):
        df_valid = df_valid.drop(df_valid[df_valid.d == 'd_%d' % (valid_start_date-lag_length-1+idx)].index)     
        for feature_fct in feature_fcts: 
            df_valid = feature_fct(df_valid)
        df_valid.loc[df_valid[train_cols].dropna().index, 'sales'] = m_lgb.predict(df_valid[train_cols].dropna())
    return df_valid

# Prediction of low level data

Using the relative number of soled items in the previous time. 

In total, this means, that the LGB classifier predicts how many items per store and category will be sold. Furthermore, I also know, how often a specific item was sold relative to the total number of the sold item. With both information, I predict how a specific item will be sold. 

In [None]:
def create_rel_prediction_dt(df_valid, start_time=1886, pred_time=28): 
    endtime = min(start_time+27, 1941)
    dt = create_dt(True, None, start_time, endtime, skip_prices=True)
    dt.loc[:, 'sales'] = np.NAN
    dt = dt.join(
        df_valid.groupby(['dept_id', 'store_id', 'd']).sum()['sales'],
        on=['dept_id', 'store_id', 'd'], rsuffix='_r'
    )
    relpred = create_dt(True, None, start_time-1-pred_time, start_time-1, skip_prices=True)
    relpred = relpred.join(
        relpred.groupby(['dept_id', 'store_id', 'd']).sum()['sales'],
        on=['dept_id', 'store_id', 'd'], rsuffix='_sum'
    )
    relpred['sales_rel'] = relpred['sales'] / relpred['sales_sum']
    dt = dt.join(relpred.groupby('id').mean()['sales_rel'], on='id')
    dt['preds'] = dt['sales_r'] * dt['sales_rel']
    dt = dt[['id', 'd', 'preds']].copy()
    dt = dt.pivot(columns='d', index='id', values='preds')
    return dt 

In [None]:
def create_rel_week_prediction_dt(df_valid, start_time=1886, pred_time=28): 
    endtime = min(start_time+27, 1941)
    dt = create_dt(True, None, start_time, endtime, skip_prices=True)
    dt.loc[:, 'sales'] = np.NAN
    dt = dt.join(
        df_valid.groupby(['dept_id', 'store_id', 'd']).sum()['sales'],
        on=['dept_id', 'store_id', 'd'], rsuffix='_r'
    )
    relpred = create_dt(True, None, start_time-1-pred_time, start_time-1, skip_prices=True)
    relpred = relpred.join(
        relpred.groupby(['dept_id', 'store_id', 'd']).sum()['sales'],
        on=['dept_id', 'store_id', 'd'], rsuffix='_sum'
    )
    relpred['sales_rel'] = relpred['sales'] / relpred['sales_sum']
    dt = dt.join(relpred.groupby(['id', 'wday']).mean()['sales_rel'], on=['id', 'wday'])
    dt['preds'] = dt['sales_r'] * dt['sales_rel']
    dt = dt[['id', 'd', 'preds']].copy()
    dt = dt.pivot(columns='d', index='id', values='preds')
    return dt 

In [None]:
def calc_kpi_high_level(df_valid, start_time=1886): 
    dt = create_dt(True, None, start_time, start_time+27, skip_prices=True)
    dt.loc[:, 'sales'] = np.NAN
    groupbyids = ['dept_id', 'store_id', 'state_id', 'store_id']
    groupbyids = [x for x in groupbyids if x in df_valid.columns]
    groupbyids.append('d')
    dt = dt.join(
        df_valid.groupby(groupbyids).sum()['sales'],
        on=groupbyids, rsuffix='_r'
    )
    dt = dt.join(
        dt.groupby(groupbyids).count()['sales_r'],
        on=groupbyids, rsuffix='_r'
    )
    dt['preds'] = dt.sales_r / dt.sales_r_r
    dt = dt[['id', 'd', 'preds']].copy()
    dt = dt.pivot(columns='d', index='id', values='preds')
    train_df = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_evaluation.csv')
    calendar = pd.read_csv('../input/m5-forecasting-accuracy/calendar.csv')
    prices = pd.read_csv('../input/m5-forecasting-accuracy/sell_prices.csv')
    val_columns = ['d_%d' % (x) for x in range(start_time, start_time+28)]
    valid_fold_df = train_df.loc[:, val_columns].copy()
    e = WRMSSEEvaluator(train_df, valid_fold_df, calendar, prices)
    return [e.score(dt[[f'd_{x}' for x in range(start_time, start_time+28)]], start_time-28), 
            e.score(dt[[f'd_{x}' for x in range(start_time, start_time+28)]], 1886), 
            e.score(dt[[f'd_{x}' for x in range(start_time, start_time+28)]], 1914)]



In [None]:
def calc_kpi_complete(dt, start_time=1886): 
    train_df = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_evaluation.csv')
    calendar = pd.read_csv('../input/m5-forecasting-accuracy/calendar.csv')
    prices = pd.read_csv('../input/m5-forecasting-accuracy/sell_prices.csv')
    val_columns = ['d_%d' % (x) for x in range(start_time, start_time+28)]
    valid_fold_df = train_df.loc[:, val_columns].copy()
    valid_fold_df = valid_fold_df.reset_index()
    e = WRMSSEEvaluator(train_df, valid_fold_df, calendar, prices)
    return [e.score(dt, start_time-28), e.score(dt, 1886), e.score(dt, 1914)]

In [None]:
dt = get_highlevel_data()
dt = get_lag_features(dt)
dt = get_lag_events(dt)
m_lgb, traincols = train_lgb(dt, start_valid=1914, end_valid=1941)

In [None]:
high_level_predictions_validation = predict(m_lgb, traincols, dt, feature_fcts=[get_lag_features], valid_start_date=1914)

In [None]:
calc_kpi_high_level(high_level_predictions_validation, start_time=1914)

In [None]:
low_level_predictions_validation = create_rel_prediction_dt(high_level_predictions_validation, start_time=1914, pred_time=28)

In [None]:
calc_kpi_complete(low_level_predictions_validation[[f'd_{x}' for x in range(start_time, start_time+28)]], start_time=1914)

The score on the public leaderboard is 0.58110 with this submission, which fits very well to my calculation. 

# Tree-based classification 
I use the lgb classifier, which is a tree-based classifier. Similar to KNNs, tree-based classifiers only look for the best training samples and the value of these training samples is the prediction. Therefore, tree-based (or KNN) classifiers are not very well for predicting trends. Thus, the question is, do we have a trend here? 

In [None]:
df_vis = get_highlevel_data()

In [None]:
df_vis.groupby('date').sales.sum().plot()
plt.show()

It seems that we have a long-term trend here. 

In [None]:
df_vis[df_vis.date.dt.year >= 2015].groupby('date').sales.sum().plot()
plt.show() 
df_vis[df_vis.date.dt.year >= 2016].groupby('date').sales.sum().plot()
plt.show()

However, we need to keep in mind that we only need to predict the next 30 days ahead. If we look only at the data from 2015 and 2016 (and even if we only look at 2016), I don't see a strong trend. Yes, there might be a trend. However, I think, this trend is not very strong in the next 30 days so that the usage of a tree-based classifier is acceptable. 

# Create submission

In [None]:
high_level_predictions_evaluation = predict(m_lgb, traincols, dt, feature_fcts=[get_lag_features], valid_start_date=1942)

In [None]:
low_level_predictions_evaluation = create_rel_prediction_dt(high_level_predictions_evaluation, start_time=1942, pred_time=28)

In [None]:
low_level_predictions_validation = low_level_predictions_validation[["d_%d" % (x) for x in range(1914, 1942)]]

In [None]:
low_level_predictions_validation = low_level_predictions_validation.rename(columns={"d_%d"%(x+1914): "F%d"%(x+1) for x in range(28)})

In [None]:
low_level_predictions_validation = low_level_predictions_validation.reset_index()

In [None]:
low_level_predictions_validation["id"] = low_level_predictions_validation["id"].str.replace( "evaluation$", "validation")

In [None]:
low_level_predictions_evaluation = low_level_predictions_evaluation[["d_%d" % (x) for x in range(1942, 1970)]]
low_level_predictions_evaluation = low_level_predictions_evaluation.rename(columns={"d_%d"%(x+1942): "F%d"%(x+1) for x in range(28)})
low_level_predictions_evaluation = low_level_predictions_evaluation.reset_index()

In [None]:
submission = pd.concat([low_level_predictions_validation, low_level_predictions_evaluation])

In [None]:
submission.to_csv('submission.csv', index=False)

Due to late submission the scores are: 

0.62260 (privat) and 0.58110 (public). 

So, this is correct du to my previous submission