# About this notebook
- Actually, the sales prediction this time is a time series problem, but we solve it by machine learning algorithms.
- Hence, to have lag features will increase our performance by intuition.
- Previous notebook: https://www.kaggle.com/kaiweihuang/m5-forecasting-accuracy-sales-basic-features

In [None]:
# Set environment variables
import os
import time
import warnings
import numpy as np
import pandas as pd

VERSION = 1
INPUT_PATH = f"/kaggle/input/m5-forecasting-accuracy-sales-basic-features"
BASE_PATH = f"/kaggle/working/m5-forecasting-accuracy-ver{VERSION}"

In [None]:
# Turn off warnings

warnings.filterwarnings("ignore")

In [None]:
# Change directory

os.chdir(INPUT_PATH)
print(f"Change to directory: {os.getcwd()}")

In [None]:
# Memory usage function

def format_memory_usage(total_bytes):
    unit_list = ["", "Ki", "Mi", "Gi"]
    for unit in unit_list:
        if total_bytes < 1024:
            return f"{total_bytes:.2f}{unit}B"
        total_bytes /= 1024
    return f"{total_bytes:.2f}{unit}B"

In [None]:
# Set global variables

days_to_predict = 28

In [None]:
# Load dataset from our previous work

df_lag_features = pd.read_pickle("m5-forecasting-accuracy-ver1/sales_basic_features.pkl")
df_lag_features.head(10)

# Feature Engineering - Sales - Lag Features
- The number of days we are going to predict is 28.
- So, it is better that lag feature starts from 28 to ensure that every prediction row contains that feature value.
- However, if we always shift each item for 28 days, it is not appropriate as well because the data loss for training is not small.
- We plan to have lag features for each day from 1 to 28.

In [None]:
# Get necessary columns only

df_lag_features = df_lag_features[["id", "d", "sales"]]
df_lag_features.head(10)

In [None]:
# Create features
# Generate basic lag features and control the memory usage

df_lag_grouped = df_lag_features.groupby(["id"])["sales"]

for i in range(days_to_predict):

    start_time = time.time()
    print(f"Day {str(i+1)} Start.")

    df_lag_features = df_lag_features.assign(**{f"sales_lag_{str(i+1)}": df_lag_grouped.transform(lambda x: x.shift(i + 1))})
    df_lag_features[f"sales_lag_{str(i+1)}"] = df_lag_features[f"sales_lag_{str(i+1)}"].astype(np.float16)

    end_time = time.time()
    print(f"Calculation time: {round(end_time - start_time)} seconds")

In [None]:
# Check dataset

df_lag_features.head(30)

# Note
- There are many ways to deal with those "NaN" values after creating those lag features.
- Dropping them is not recommended because we will lose a lot of important information.
- Since we plan to use LightGBM to train, it is fine to let those "NaN" values be there.
- In addition, because "groupby" preserves the order of rows within each group,
- we don't need to sort the DataFrame again, which can be directly used by joining the original features on "id" and "d". 

In [None]:
# Check current memory usage

memory_usage_string = format_memory_usage(df_lag_features.memory_usage().sum())
print(f"Current memory usage: {memory_usage_string}")

In [None]:
# Check data type

df_lag_features.info()

In [None]:
# Change to output path

try:
    os.chdir(BASE_PATH)
    print(f"Change to directory: {os.getcwd()}")
except:
    os.mkdir(BASE_PATH)
    os.chdir(BASE_PATH)
    print(f"Create and change to directory: {os.getcwd()}")

In [None]:
# Save pickle file

df_lag_features.to_pickle("sales_lag_features.pkl")