# Ensemble  time series prediction

This time, using Predict Future Sales as the subject, extraction of features by time series analysis and creation of time series feature quantities were performed to make future predictions. <br>
For feature quantity prediction, nonlinear regression LGBM and linear regression Ridge, Lasso, and ElasticNet were combined, and prediction was performed as ensemble learning.

### Libraries

In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Libraries
import datetime

# Visualization
from matplotlib import pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns

import statsmodels.api as sm

# Statistics library
from scipy.stats import norm
from scipy import stats
import scipy

# Data preprocessing
from sklearn.model_selection import train_test_split

# Machine learning
import lightgbm as lgb
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

# Validataion
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

### Data loading and checking

In [None]:
df_items = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/items.csv", header=0)
df_shops = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/shops.csv", header=0)
df_sales_train = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv", header=0)
df_test = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/test.csv", header=0)
df_category = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv", header=0)
sample = pd.read_csv("/kaggle/input/competitive-data-science-predict-future-sales/sample_submission.csv")

In [None]:
df_items.head()

In [None]:
df_items.shape

In [None]:
df_shops.head()

In [None]:
df_shops.shape

In [None]:
df_sales_train.head()

In [None]:
df_sales_train.shape

In [None]:
df_test.head()

In [None]:
df_test.shape

In [None]:
df_category.head()

In [None]:
df_category.shape

In [None]:
sample.head()

In [None]:
sample.shape

Null data

In [None]:
print("Null data df_sales_train:{}".format(df_sales_train.isnull().sum().sum()))
print("Null data df_items:{}".format(df_items.isnull().sum().sum()))
print("Null data df_shops:{}".format(df_shops.isnull().sum().sum()))
print("Null data df_category:{}".format(df_category.isnull().sum().sum()))

Null value does not exist in the provided data.

### Data preprocessing
First, data processing was performed for time series analysis.

In [None]:
### Data preprocessing, df_train, datetime
df_sales_train["date_dt"] = pd.to_datetime(df_sales_train["date"], format='%d.%m.%Y')

df_sales_train["year"] = df_sales_train["date_dt"].dt.year
df_sales_train["month"] = df_sales_train["date_dt"].dt.month
df_sales_train["day"] = df_sales_train["date_dt"].dt.day

In [None]:
df_sales_train["item_sales"] = df_sales_train["item_price"]*df_sales_train["item_cnt_day"]
df_sales_train = pd.merge(df_sales_train, df_items[["item_id", "item_category_id"]], left_on="item_id", right_on="item_id", how="left")
train_df = df_sales_train.drop("date", axis=1).sort_index()

## EDA
## Time series of daily sales

### Total daily sales

In [None]:
tot_daily_sales = train_df.groupby("date_dt").sum()["item_sales"]

In [None]:
# Time series
fig, ax1 = plt.subplots(figsize=(20,6))
ax1.plot(tot_daily_sales.index, tot_daily_sales/1000, linewidth=1)
ax1.set_ylabel("Sales(k)")
ax2 = ax1.twinx()
ax2.plot(tot_daily_sales.index, tot_daily_sales.cumsum()/1000000, linewidth=1, color="red")
ax2.grid()
ax2.set_ylabel("Total Sales(M)")
plt.xlabel("time")

Looking at daily sales in chronological order, you can see strong sales growth in January, June and December. In addition, from December to January, sales are not on a daily basis, but the lower limit is also rising, indicating that there are periods when purchasing is strong on a monthly basis. On the other hand, the lower limit has not risen at the beginning of June, and it is speculated that it may be a short-term campaign-like event.<br>

The forecasted month is the position where the sales start to rise, and it can be seen that it is more important to capture the seasonality than the context.

In [None]:
# item sales distribution
plt.figure(figsize=(10,6))
sns.distplot(tot_daily_sales, kde=False, bins=50)
plt.ylabel("Frequency")
plt.yscale("log")
plt.xlabel("Total daily sales")

Looking at the distribution of daily sales, there are many with less sales. Occasionally, there are days when you'll record big sales like spikes, but that's noise in the forecast. Since this is likely to have an adverse effect on future learning models, we decided to exclude it as an abnormal value from training data.

Therefore, an abnormal value is detected for the price and the number of sales that have an influence on sales, and the distributions that exclude the abnormal value are compared.

In [None]:
fig, ax = plt.subplots(2,2,figsize=(20,12))
plt.subplots_adjust(hspace=0.5)
sns.boxplot(train_df["item_cnt_day"], ax=ax[0,0])
ax[0,0].set_title("item_cnt_day")

sns.boxplot(train_df[train_df["item_cnt_day"]<800]["item_cnt_day"], ax=ax[0,1])
ax[0,1].set_title("item_cnt_day Remove outlier")

sns.boxplot(train_df["item_price"], ax=ax[1,0])
ax[1,0].set_title("item_price")

sns.boxplot(train_df[train_df["item_price"]<70000]["item_price"], ax=ax[1,1])
ax[1,1].set_title("item_price Remove outlier")

In [None]:
# Update
train_df = train_df[train_df["item_cnt_day"]<800]
train_df = train_df[train_df["item_price"]<70000]

In [None]:
# Re aggrecate tot_daily_sales
tot_daily_sales = train_df.groupby("date_dt").sum()["item_sales"]

## Decomposition of time series components
Sales were conducted on the total sales, the objects classified by shop, the objects classified by category, and a sample of some items. In order to see the seasonality, we decomposed it by month and year.

## Total daily salse

### Annual trend, Total daily sales

In [None]:
# freq = 365 day
res = sm.tsa.seasonal_decompose(tot_daily_sales, freq=365)

# Decomposition
trend = res.trend
seaso = res.seasonal
resid = res.resid

In [None]:
# Visualization
fig, ax = plt.subplots(4,1, figsize=(15,15))
plt.subplots_adjust(hspace=0.5)

ax[0].plot(tot_daily_sales.index, tot_daily_sales, color="black")
ax[0].set_title("Time series")
ax[0].set_ylabel("Daily_sales\n(Frequeycy:365day)")
ax[0].set_xlabel("Time")

ax[1].plot(trend.index, trend, color="red")
ax[1].set_title("Trend")
ax[1].set_ylabel("Daily_sales\n(Frequeycy:365day)")
ax[1].set_xlabel("Time")

ax[2].plot(seaso.index, seaso, color="blue")
ax[2].set_title("Seasonal")
ax[2].set_ylabel("Daily_sales\n(Frequeycy:365day)")
ax[2].set_xlabel("Time")

ax[3].plot(resid.index, resid, color="green")
ax[3].set_title("Resid")
ax[3].set_ylabel("Daily_sales\n(Frequeycy:365day)")
ax[3].set_xlabel("Time")

* Decomposition was performed to confirm the annual periodicity. Looking at the results, the sales growth from December to January that I saw earlier became clearer. In addition, trends that were not noticed as a whole became clear, and it was found that sales are on a slightly downward trend compared to last year.

In [None]:
del trend, seaso, resid

### Monthly trend, Total daily sales

In [None]:
# freq = 30 day
res = sm.tsa.seasonal_decompose(tot_daily_sales, freq=30)

# Decomposition
trend = res.trend
seaso = res.seasonal
resid = res.resid

In [None]:
# Visualization
fig, ax = plt.subplots(4,1, figsize=(15,15))
plt.subplots_adjust(hspace=0.5)

ax[0].plot(tot_daily_sales.index[-365:], tot_daily_sales[-365:], color="black")
ax[0].set_title("Time series")
ax[0].set_ylabel("Daily_sales\n(Frequeycy:30day)")
ax[0].set_xlabel("Time")

ax[1].plot(trend.index[-365:], trend[-365:], color="red")
ax[1].set_title("Trend")
ax[1].set_ylabel("Daily_sales\n(Frequeycy:30day)")
ax[1].set_xlabel("Time")

ax[2].plot(seaso.index[-365:], seaso[-365:], color="blue")
ax[2].set_title("Seasonal")
ax[2].set_ylabel("Daily_sales\n(Frequeycy:30day)")
ax[2].set_xlabel("Time")

ax[3].plot(resid.index[-365:], resid[-365:], color="green")
ax[3].set_title("Resid")
ax[3].set_ylabel("Daily_sales\n(Frequeycy:30day)")
ax[3].set_xlabel("Time")

Next, when the periodic component was extracted and confirmed on a monthly basis, the growth on a specific day can also be confirmed. Also, looking at the trend without periodicity, we can see that the forecast month will increase sales by about 30% compared to the previous month.

In [None]:
del trend, seaso, resid, tot_daily_sales

## Shop daily sales

For the sales analysis, the dataset also provides information on shops and categories. Probably, the sales trend will change from shop to shop, and it can be inferred that it also differs from category to category. Therefore, time series analysis was performed separately for each. The trend and periodic components were decomposed as before so that the features could be confirmed.

### Annnual trend, shop daily sales

In [None]:
# pivot by shops
shops_pivot = pd.pivot_table(train_df, index="date_dt", columns="shop_id", values="item_sales", aggfunc="sum", fill_value=0)

# Shops sample, id=0 & 2 & 3
sample_0 = shops_pivot[0]
sample_1 = shops_pivot[2]
sample_2 = shops_pivot[3]

# freq = 365 day
res_0 = sm.tsa.seasonal_decompose(sample_0, freq=365)
res_1 = sm.tsa.seasonal_decompose(sample_1, freq=365)
res_2 = sm.tsa.seasonal_decompose(sample_2, freq=365)

# Decomposition
trend_0 = res_0.trend
seaso_0 = res_0.seasonal
resid_0 = res_0.resid

trend_1 = res_1.trend
seaso_1 = res_1.seasonal
resid_1 = res_1.resid

trend_2 = res_2.trend
seaso_2 = res_2.seasonal
resid_2 = res_2.resid

In [None]:
# Visualization
fig, ax = plt.subplots(4,3, figsize=(25,15))
plt.subplots_adjust(hspace=0.5,)

ax[0,0].plot(sample_0.index, sample_0, color="black")
ax[0,0].set_title("Shop0_Time series")
ax[0,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[0,0].tick_params(axis='x', labelsize=10)

ax[1,0].plot(trend_0.index, trend_0, color="red")
ax[1,0].set_title("Shop0_Trend")
ax[1,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[1,0].set_xlabel("Time")
ax[1,0].tick_params(axis='x', labelsize=10)

ax[2,0].plot(seaso_0.index, seaso_0, color="blue")
ax[2,0].set_title("Shop0_Seasonal")
ax[2,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[2,0].set_xlabel("Time")
ax[2,0].tick_params(axis='x', labelsize=10)

ax[3,0].plot(resid_0.index, resid_0, color="green")
ax[3,0].set_title("Shop0_Resid")
ax[3,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[3,0].set_xlabel("Time")
ax[3,0].tick_params(axis='x', labelsize=10)

ax[0,1].plot(sample_1.index, sample_1, color="black")
ax[0,1].set_title("Shop2_Time series")
ax[0,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[0,1].set_xlabel("Time")
ax[0,1].tick_params(axis='x', labelsize=10)

ax[1,1].plot(trend_1.index, trend_1, color="red")
ax[1,1].set_title("Shop2_Trend")
ax[1,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[1,1].set_xlabel("Time")
ax[1,1].tick_params(axis='x', labelsize=10)

ax[2,1].plot(seaso_1.index, seaso_1, color="blue")
ax[2,1].set_title("Shop2_Seasonal")
ax[2,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[2,1].set_xlabel("Time")
ax[2,1].tick_params(axis='x', labelsize=10)

ax[3,1].plot(resid_1.index, resid_1, color="green")
ax[3,1].set_title("Shop2_Resid")
ax[3,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[3,1].set_xlabel("Time")
ax[3,1].tick_params(axis='x', labelsize=10)

ax[0,2].plot(sample_2.index, sample_2, color="black")
ax[0,2].set_title("Shop3_Time series")
ax[0,2].set_ylabel("Daily_sales(Frequeycy:365day)", fontsize=15)
ax[0,2].set_xlabel("Time")
ax[0,2].tick_params(axis='x', labelsize=10)

ax[1,2].plot(trend_2.index, trend_2, color="red")
ax[1,2].set_title("Shop3_Trend")
ax[1,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[1,2].set_xlabel("Time")
ax[1,2].tick_params(axis='x', labelsize=10)

ax[2,2].plot(seaso_2.index, seaso_2, color="blue")
ax[2,2].set_title("Shop3_Seasonal")
ax[2,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[2,2].set_xlabel("Time")
ax[2,2].tick_params(axis='x', labelsize=10)

ax[3,2].plot(resid_2.index, resid_2, color="green")
ax[3,2].set_title("Shop3_Resid")
ax[3,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[3,2].set_xlabel("Time")
ax[3,2].tick_params(axis='x', labelsize=10)

I extracted three shops, 0, 2, and 3. Looking at the results, it can be seen that the tendency is particularly different for No. 0. Initially there was sales, but since then it has disappeared. In the background, it can be guessed that the shop is gone, but the important thing is that each is different. We thought that it was necessary to have information for each shop, create separate feature quantities, and add them to the prediction model.

In [None]:
del res_0, res_1, res_2, trend_0, seaso_0, resid_0, trend_1, seaso_1, resid_1, trend_2, seaso_2, resid_2

## Category daily sales

### Annnual trend, Category sales

In [None]:
# pivot by category
category_pivot = pd.pivot_table(train_df, index="date_dt", columns="item_category_id", values="item_sales", aggfunc="sum", fill_value=0)

# Shops sample, id=0 & 2 & 3
sample_0 = category_pivot[0]
sample_1 = category_pivot[2]
sample_2 = category_pivot[3]

# freq = 365 day
res_0 = sm.tsa.seasonal_decompose(sample_0, freq=365)
res_1 = sm.tsa.seasonal_decompose(sample_1, freq=365)
res_2 = sm.tsa.seasonal_decompose(sample_2, freq=365)

# Decomposition
trend_0 = res_0.trend
seaso_0 = res_0.seasonal
resid_0 = res_0.resid

trend_1 = res_1.trend
seaso_1 = res_1.seasonal
resid_1 = res_1.resid

trend_2 = res_2.trend
seaso_2 = res_2.seasonal
resid_2 = res_2.resid

In [None]:
# Visualization
fig, ax = plt.subplots(4,3, figsize=(25,15))
plt.subplots_adjust(hspace=0.5,)

ax[0,0].plot(sample_0.index, sample_0, color="black")
ax[0,0].set_title("Category0_Time series")
ax[0,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[0,0].set_xlabel("Time")
ax[0,0].tick_params(axis='x', labelsize=10)

ax[1,0].plot(trend_0.index, trend_0, color="red")
ax[1,0].set_title("Category0_Trend")
ax[1,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[1,0].set_xlabel("Time")
ax[1,0].tick_params(axis='x', labelsize=10)

ax[2,0].plot(seaso_0.index, seaso_0, color="blue")
ax[2,0].set_title("Category0_Seasonal")
ax[2,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[2,0].set_xlabel("Time")
ax[2,0].tick_params(axis='x', labelsize=10)

ax[3,0].plot(resid_0.index, resid_0, color="green")
ax[3,0].set_title("Category0_Resid")
ax[3,0].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[3,0].set_xlabel("Time")
ax[3,0].tick_params(axis='x', labelsize=10)

ax[0,1].plot(sample_1.index, sample_1, color="black")
ax[0,1].set_title("Category2_Time series")
ax[0,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[0,1].set_xlabel("Time")
ax[0,1].tick_params(axis='x', labelsize=10)

ax[1,1].plot(trend_1.index, trend_1, color="red")
ax[1,1].set_title("Category2_Trend")
ax[1,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[1,1].set_xlabel("Time")
ax[1,1].tick_params(axis='x', labelsize=10)

ax[2,1].plot(seaso_1.index, seaso_1, color="blue")
ax[2,1].set_title("Category2_Seasonal")
ax[2,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[2,1].set_xlabel("Time")
ax[2,1].tick_params(axis='x', labelsize=10)

ax[3,1].plot(resid_1.index, resid_1, color="green")
ax[3,1].set_title("Category2_Resid")
ax[3,1].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[3,1].set_xlabel("Time")
ax[3,1].tick_params(axis='x', labelsize=10)

ax[0,2].plot(sample_2.index, sample_2, color="black")
ax[0,2].set_title("Category3_Time series")
ax[0,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[0,2].set_xlabel("Time")
ax[0,2].tick_params(axis='x', labelsize=10)

ax[1,2].plot(trend_2.index, trend_2, color="red")
ax[1,2].set_title("Category3_Trend")
ax[1,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[1,2].set_xlabel("Time")
ax[1,2].tick_params(axis='x', labelsize=10)

ax[2,2].plot(seaso_2.index, seaso_2, color="blue")
ax[2,2].set_title("Category3_Seasonal")
ax[2,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[2,2].set_xlabel("Time")
ax[2,2].tick_params(axis='x', labelsize=10)

ax[3,2].plot(resid_2.index, resid_2, color="green")
ax[3,2].set_title("Category3_Resid")
ax[3,2].set_ylabel("Daily_sales\n(Frequeycy:365day)", fontsize=15)
ax[3,2].set_xlabel("Time")
ax[3,2].tick_params(axis='x', labelsize=10)

It can be seen that the categories also have different tendencies, similar to shop. As before, some have sales only for the first time and then become 0, while others, such as Category 2 and Category 3, have sales up to the latest, but some are on a downward trend, while others are flat. Therefore, each category also has information and needs to be added to the feature amount.

In [None]:
del res_0, res_1, res_2, trend_0, seaso_0, resid_0, trend_1, seaso_1, resid_1, trend_2, seaso_2, resid_2

## Item time series

### Time series analysis, Item sales & Item count & Item price

In [None]:
# pivot by category
item_pivot_sales = pd.pivot_table(train_df, index="date_dt", columns="item_id", values="item_sales", aggfunc="mean", fill_value=0)
item_pivot_count = pd.pivot_table(train_df, index="date_dt", columns="item_id", values="item_price", aggfunc="count", fill_value=0)
item_pivot_price = pd.pivot_table(train_df, index="date_dt", columns="item_id", values="item_price", aggfunc="mean", fill_value=0)

In [None]:
# Sample 1000, 2000, 10000
fig, ax = plt.subplots(3,3, figsize=(25,20))

item_list = [1000,2000,10000]
pivot_list = [item_pivot_sales, item_pivot_count, item_pivot_price]

for i in range(len(item_list)):
    for k in range(len(pivot_list)):
        ax[i,k].plot(pivot_list[k][item_list[i]].index, pivot_list[k][item_list[i]])
        ax[i,k].set_xlabel("Time")
        ax[i,k].tick_params(axis='x', labelsize=10)
        ax[i,0].set_ylabel("Sales")
        ax[i,1].set_ylabel("Count")
        ax[i,2].set_ylabel("Price")
        ax[i,k].set_title("item_id:{}".format(item_list[i]))

del item_pivot_sales, item_pivot_count, item_pivot_price

At the end of the time series analysis, some items were extracted and plotted for each day's sales, number and price. Looking at the results, it can be seen that not only time-series sales fall, but there are times when they rise, which is not due to an increase in the number of sales but to an increase in prices. In order to predict sales, it may be important to track not only the number sold but the price. The number of sales may be reflected in the price.

## Feature engineering

Based on the results of the time series analysis up to this point, we have decided the following policy regarding the information for establishing the price forecast model.

### Direction of features engineering
- Important information<br>
Trend : Information on the downward trend of the current year is required<br>
Seasonal : reflect the trend information of the previous year<br>
Also, reflect time series information for each of shop, category, item, need both sales and price information.<br>

Train the model with a dataset and features that can reflect this information.

### Features
Each shops and categorys and items, I made the following features.<br>
 Rag features : 1, 2, 3, 6 lag<br>
 Trend features : Use 3 points from one year ago, periods of 1~3~6 month.<br>
 Seasonal features : Use 4 points from one year ago, periods of 1~2~3~6 month.  

## How to create a dataset
### Test data set
Time series until last month : predict next month.

### Training data set
Use the information up to a year ago to make sure that the forecast months are the same.



In [None]:
# Prepairing dataset
master = train_df.copy()

Define class<br>
Calculate monthly lag:1,2,3,6 and trend lag:1,3,6 and seasonal lag:1,2,3,6. 

In [None]:
# Define class
class feature_eng:
    def __init__(self, data_ser, data_df, seasonal_len, name):
        self.date = data_ser
        self.data = data_df
        self.seas = seasonal_len
        self.name = name
        self.col = data_df.columns
        
    def sep_lag_trend_seaso_train(self):
        max_list = []
        mean_list = []
        lag1_list = []
        lag2_list = []
        lag3_list = []
        lag4_6_list = []
        seas1_list = []
        seas2_list = []
        seas3_list = []
        seas4_6_list = []        
        
        for i in self.col:
            # Calculate trend and seasonal facter
            res = sm.tsa.seasonal_decompose(self.data[i], freq=self.seas)
            last = self.data[i].values[-2]
            max_ = self.data[i].values[-2].max()
            mean = self.data[i].values[-8:-2].mean()
            # Append to list
            max_list.append(self.data[i].values[:-2].max())
            mean_list.append(self.data[i].values[:-2].mean())
            lag1_list.append((last - self.data[i].values[-3])*1)
            lag2_list.append((last - self.data[i].values[-4])*2)
            lag3_list.append((last - self.data[i].values[-5])*3)
            lag4_6_list.append((last - (self.data[i].values[-6]+self.data[i].values[-7]+self.data[i].values[-8]))*15)
            seas1_list.append((res.seasonal.values[-self.seas-1])*1)
            seas2_list.append((res.seasonal.values[-self.seas-2])*2)
            seas3_list.append((res.seasonal.values[-self.seas-3])*3)
            seas4_6_list.append((res.seasonal.values[-self.seas-4]+res.seasonal.values[-self.seas-5]+res.seasonal.values[-self.seas-6])*15)
        # Output data frame
        out = pd.DataFrame({"id":self.col,
                            "{}_max".format(self.name):max_list,
                            "{}_mean".format(self.name):mean_list,
                            "{}_lag1".format(self.name):lag1_list,
                            "{}_lag2".format(self.name):lag2_list,
                            "{}_lag3".format(self.name):lag3_list,
                            "{}_lag4_6".format(self.name):lag4_6_list,
                            "{}_seas1".format(self.name):seas1_list,
                            "{}_seas2".format(self.name):seas2_list,
                            "{}_seas3".format(self.name):seas3_list,
                            "{}_seas4_6".format(self.name):seas4_6_list
                           })
        return out
    
    def sep_lag_trend_seaso_test(self):
        max_list = []
        mean_list = []
        lag1_list = []
        lag2_list = []
        lag3_list = []
        lag4_6_list = []
        seas1_list = []
        seas2_list = []
        seas3_list = []
        seas4_6_list = []   
        
        for i in self.col:
            # Calculate trend and seasonal facter
            res = sm.tsa.seasonal_decompose(self.data[i], freq=self.seas)
            last = self.data[i].values[-1]
            max_ = self.data[i][:-1].values.max()
            mean = self.data[i].values[-19:-1].mean()
            # Append to list
            max_list.append(self.data[i].values[:-1].max())
            mean_list.append(self.data[i].values[:-1].mean())
            lag1_list.append((last - self.data[i].values[-2])*1)
            lag2_list.append((last - self.data[i].values[-3])*2)
            lag3_list.append((last - self.data[i].values[-4])*3)
            lag4_6_list.append((last - (self.data[i].values[-5]+self.data[i].values[-6]+self.data[i].values[-7]))*15)
            seas1_list.append((res.seasonal.values[-self.seas-1])*1)
            seas2_list.append((res.seasonal.values[-self.seas-2])*2)
            seas3_list.append((res.seasonal.values[-self.seas-3])*3)
            seas4_6_list.append((res.seasonal.values[-self.seas-4]+res.seasonal.values[-self.seas-5]+res.seasonal.values[-self.seas-6])*15)
        # Output data frame
        out = pd.DataFrame({"id":self.col,
                            "{}_max".format(self.name):max_list,
                            "{}_mean".format(self.name):mean_list,
                            "{}_lag1".format(self.name):lag1_list,
                            "{}_lag2".format(self.name):lag2_list,
                            "{}_lag3".format(self.name):lag3_list,
                            "{}_lag4_6".format(self.name):lag4_6_list,
                            "{}_seas1".format(self.name):seas1_list,
                            "{}_seas2".format(self.name):seas2_list,
                            "{}_seas3".format(self.name):seas3_list,
                            "{}_seas4_6".format(self.name):seas4_6_list
                           })
        return out

In [None]:
# Define class
class feature_eng:
    def __init__(self, data_ser, data_df, seasonal_len, name):
        self.date = data_ser
        self.data = data_df
        self.seas = seasonal_len
        self.name = name
        self.col = data_df.columns
        
    def sep_lag_trend_seaso_train(self):
        max_list = []
        mean_list = []
        lag1_list = []
        lag2_list = []
        lag3_list = []
        lag4_6_list = []
        tre1_list = []
        tre3_list = []
        tre4_6_list = []
        seas1_list = []
        seas2_list = []
        seas3_list = []
        seas4_6_list = []        
        
        for i in self.col:
            # Calculate trend and seasonal facter
            res = sm.tsa.seasonal_decompose(self.data[i], freq=self.seas)
            last = self.data[i].values[-2]
            max_ = self.data[i].values[-2].max()
            mean = self.data[i].values[-8:-2].mean()
            # Append to list
            max_list.append(self.data[i].values[:-2].max())
            mean_list.append(self.data[i].values[:-2].mean())
            lag1_list.append((last - self.data[i].values[-3])*1)
            lag2_list.append((last - self.data[i].values[-4])*2)
            lag3_list.append((last - self.data[i].values[-5])*3)
            lag4_6_list.append((last - (self.data[i].values[-6]+self.data[i].values[-7]+self.data[i].values[-8]))*15)
            tre1_list.append((last - res.trend.values[-int(self.seas*0.5)-1])*1)
            tre3_list.append((last - res.trend.values[-int(self.seas*0.5)-3])*3)
            tre4_6_list.append((last - (res.trend.values[-int(self.seas*0.5)-4]+res.trend.values[-int(self.seas*0.5)-5]+res.trend.values[-int(self.seas*0.5)-6]))*15)
            seas1_list.append((res.seasonal.values[-self.seas-1])*1)
            seas2_list.append((res.seasonal.values[-self.seas-2])*2)
            seas3_list.append((res.seasonal.values[-self.seas-3])*3)
            seas4_6_list.append((res.seasonal.values[-self.seas-4]+res.seasonal.values[-self.seas-5]+res.seasonal.values[-self.seas-6])*15)
        # Output data frame
        out = pd.DataFrame({"id":self.col,
                            "{}_max".format(self.name):max_list,
                            "{}_mean".format(self.name):mean_list,
                            "{}_lag1".format(self.name):lag1_list,
                            "{}_lag2".format(self.name):lag2_list,
                            "{}_lag3".format(self.name):lag3_list,
                            "{}_lag4_6".format(self.name):lag4_6_list,
                            "{}_tre1".format(self.name):tre1_list,
                            "{}_tre3".format(self.name):tre3_list,
                            "{}_tre4_6".format(self.name):tre4_6_list,
                            "{}_seas1".format(self.name):seas1_list,
                            "{}_seas2".format(self.name):seas2_list,
                            "{}_seas3".format(self.name):seas3_list,
                            "{}_seas4_6".format(self.name):seas4_6_list
                           })
        return out
    
    def sep_lag_trend_seaso_test(self):
        max_list = []
        mean_list = []
        lag1_list = []
        lag2_list = []
        lag3_list = []
        lag4_6_list = []
        tre1_list = []
        tre3_list = []
        tre4_6_list = []
        seas1_list = []
        seas2_list = []
        seas3_list = []
        seas4_6_list = []   
        
        for i in self.col:
            # Calculate trend and seasonal facter
            res = sm.tsa.seasonal_decompose(self.data[i], freq=self.seas)
            last = self.data[i].values[-1]
            max_ = self.data[i][:-1].values.max()
            mean = self.data[i].values[-19:-1].mean()
            # Append to list
            max_list.append(self.data[i].values[:-1].max())
            mean_list.append(self.data[i].values[:-1].mean())
            lag1_list.append((last - self.data[i].values[-2])*1)
            lag2_list.append((last - self.data[i].values[-3])*2)
            lag3_list.append((last - self.data[i].values[-4])*3)
            lag4_6_list.append((last - (self.data[i].values[-5]+self.data[i].values[-6]+self.data[i].values[-7]))*15)
            tre1_list.append((last - res.trend.values[-int(self.seas*0.5)-1])*1)
            tre3_list.append((last - res.trend.values[-int(self.seas*0.5)-3])*3)
            tre4_6_list.append((last - (res.trend.values[-int(self.seas*0.5)-4]+res.trend.values[-int(self.seas*0.5)-5]+res.trend.values[-int(self.seas*0.5)-6]))*15)
            seas1_list.append((res.seasonal.values[-self.seas-1])*1)
            seas2_list.append((res.seasonal.values[-self.seas-2])*2)
            seas3_list.append((res.seasonal.values[-self.seas-3])*3)
            seas4_6_list.append((res.seasonal.values[-self.seas-4]+res.seasonal.values[-self.seas-5]+res.seasonal.values[-self.seas-6])*15)
        # Output data frame
        out = pd.DataFrame({"id":self.col,
                            "{}_max".format(self.name):max_list,
                            "{}_mean".format(self.name):mean_list,
                            "{}_lag1".format(self.name):lag1_list,
                            "{}_lag2".format(self.name):lag2_list,
                            "{}_lag3".format(self.name):lag3_list,
                            "{}_lag4_6".format(self.name):lag4_6_list,
                            "{}_tre1".format(self.name):tre1_list,
                            "{}_tre3".format(self.name):tre3_list,
                            "{}_tre4_6".format(self.name):tre4_6_list,
                            "{}_seas1".format(self.name):seas1_list,
                            "{}_seas2".format(self.name):seas2_list,
                            "{}_seas3".format(self.name):seas3_list,
                            "{}_seas4_6".format(self.name):seas4_6_list
                           })
        return out

### shop lag and trend and seasonal

In [None]:
# Each shops count feature
shop_ts = pd.pivot_table(data=master, index=["year","month"], columns="shop_id", values="item_cnt_day", aggfunc="sum", fill_value=0)

date_ser = shop_ts.reset_index().drop(["year", "month"], axis=1).index
data_df = shop_ts.reset_index().drop(["year", "month"], axis=1)
seasonal_len = 12
name = "shop_count"

In [None]:
# Apply class for train data
shop_time_count = feature_eng(date_ser, data_df, seasonal_len, name)
shop_time_count_train = shop_time_count.sep_lag_trend_seaso_train()
# Apply class for test data
shop_time_count_test = shop_time_count.sep_lag_trend_seaso_test()

In [None]:
shop_time_count_train.head()

In [None]:
# Each shops price feature
shop_ts = pd.pivot_table(data=master, index=["year","month"], columns="shop_id", values="item_price", aggfunc="mean").fillna(method="ffill").fillna(0)

date_ser = shop_ts.reset_index().drop(["year", "month"], axis=1).index
data_df = shop_ts.reset_index().drop(["year", "month"], axis=1)
seasonal_len = 12
name = "shop_price"

In [None]:
# Apply class for train data
shop_time_price = feature_eng(date_ser, data_df, seasonal_len, name)
shop_time_price_train = shop_time_price.sep_lag_trend_seaso_train()
# Apply class for test data
shop_time_price_test = shop_time_price.sep_lag_trend_seaso_test()

In [None]:
shop_time_price_train.head()

### category lag and trend and seasonal

In [None]:
# Each Category count feature
cate_ts = pd.pivot_table(data=master, index=["year","month"], columns="item_category_id", values="item_cnt_day", aggfunc="sum", fill_value=0)

date_ser = cate_ts.reset_index().drop(["year", "month"], axis=1).index
data_df = cate_ts.reset_index().drop(["year", "month"], axis=1)
seasonal_len = 12
name = "category_count"

In [None]:
# Apply class for train data
cate_time_count = feature_eng(date_ser, data_df, seasonal_len, name)
cate_time_count_train = cate_time_count.sep_lag_trend_seaso_train()
# Apply class for test data
cate_time_count_test = cate_time_count.sep_lag_trend_seaso_test()

In [None]:
cate_time_count_train.head()

In [None]:
# Each Category price feature
cate_ts = pd.pivot_table(data=master, index=["year","month"], columns="item_category_id", values="item_price", aggfunc="mean").fillna(method="ffill").fillna(0)

date_ser = cate_ts.reset_index().drop(["year", "month"], axis=1).index
data_df = cate_ts.reset_index().drop(["year", "month"], axis=1)
seasonal_len = 12
name = "category_count"

In [None]:
# # Apply class for train data
cate_time_price = feature_eng(date_ser, data_df, seasonal_len, name)
cate_time_price_train = cate_time_price.sep_lag_trend_seaso_train()
# # Apply class for test data
cate_time_price_test = cate_time_price.sep_lag_trend_seaso_test()

In [None]:
cate_time_price_train.head()

### item lag and trend and seasonal

In [None]:
# Each Item count feature
item_ts = pd.pivot_table(data=master, index=["year","month"], columns="item_category_id", values="item_cnt_day", aggfunc="sum", fill_value=0)

date_ser = item_ts.reset_index().drop(["year", "month"], axis=1).index
data_df = item_ts.reset_index().drop(["year", "month"], axis=1)
seasonal_len = 12
name = "item_count"

In [None]:
# Apply class for train data
item_time_count = feature_eng(date_ser, data_df, seasonal_len, name)
item_time_count_train = item_time_count.sep_lag_trend_seaso_train()
# Apply class for test data
item_time_count_test = item_time_count.sep_lag_trend_seaso_test()

In [None]:
item_time_count_train.head()

In [None]:
# Each Item Price feature
item_ts = pd.pivot_table(data=master, index=["year","month"], columns="item_id", values="item_price", aggfunc="mean").fillna(method="ffill").fillna(0)

date_ser = item_ts.reset_index().drop(["year", "month"], axis=1).index
data_df = item_ts.reset_index().drop(["year", "month"], axis=1)
seasonal_len = 12
name = "category_count"

In [None]:
# Apply class for train data
item_time_price = feature_eng(date_ser, data_df, seasonal_len, name)
item_time_price_train = item_time_price.sep_lag_trend_seaso_train()
# Apply class for test data
item_time_price_test = item_time_price.sep_lag_trend_seaso_test()

In [None]:
item_time_price_train.head()

In [None]:
del shop_time_count, shop_time_price, cate_time_count, cate_time_price, item_time_count, item_time_price

## Data Preparing for training

In [None]:
# Type conversion to preserve memory
def dtype_change(df):
    columns = df.dtypes.index
    dtype = df.dtypes
    dtype = [str(d) for d in dtype]
    for i in range(len(columns)):
        if dtype[i] == 'int64':
            df[columns[i]] = df[columns[i]].astype("int32")
        elif dtype[i] == 'float64':
            df[columns[i]] = df[columns[i]].astype("float32")
        else:
            pass
    return df

In [None]:
# Training data
shop_time_count_train = dtype_change(shop_time_count_train)
shop_time_price_train = dtype_change(shop_time_price_train)
cate_time_count_train = dtype_change(cate_time_count_train)
cate_time_price_train = dtype_change(cate_time_price_train)
item_time_count_train = dtype_change(item_time_count_train)
item_time_price_train = dtype_change(item_time_price_train)
# Test data
shop_time_count_test = dtype_change(shop_time_count_test)
shop_time_price_test = dtype_change(shop_time_price_test)
cate_time_count_test = dtype_change(cate_time_count_test)
cate_time_price_test = dtype_change(cate_time_price_test)
item_time_count_test = dtype_change(item_time_count_test)
item_time_price_test = dtype_change(item_time_price_test)

## Data merging

In [None]:
# Assign test data ID to training data
master = pd.merge(master, df_test, left_on=["shop_id", "item_id"], right_on=["shop_id", "item_id"], how="left")

In [None]:
# Group by shop_id and item_id, and group them in data blocks in the column direction.
pivot = pd.pivot_table(data=master, index=["shop_id", "item_id"], columns="date_block_num", values="item_cnt_day", aggfunc="sum")

In [None]:
# The last value (base point of test data), the last value of last year (target value of training) and the previous one (base point of training data) are extracted.
last_test_block = pivot.iloc[:,-1].reset_index()
last_train_block = pivot.iloc[:,-2].reset_index()
last_train_2ndblock = pivot.iloc[:,-14].reset_index()

In [None]:
# Combine test data frame
Base = pd.merge(df_test, last_train_2ndblock, left_on=["shop_id", "item_id"], right_on=["shop_id", "item_id"], how="left")
Base = pd.merge(Base, last_train_block, left_on=["shop_id", "item_id"], right_on=["shop_id", "item_id"], how="left")
Base = pd.merge(Base, last_test_block, left_on=["shop_id", "item_id"], right_on=["shop_id", "item_id"], how="left")

Base = dtype_change(Base)
del last_test_block, last_train_block, last_train_2ndblock

In [None]:
# Create data with a corresponding relationship between item_id and category_id
category = train_df[["item_id", "item_category_id"]].drop_duplicates()

In [None]:
# merge Base and category
Base = pd.merge(Base, category, left_on="item_id", right_on="item_id", how="left")

del category

In [None]:
# Data check
# Null data
Base.isnull().sum()

In [None]:
# Data shape
Base.shape

Looking at the aggregated data, it can be seen that most of the total data is Null data. The ratio is 90% or more.
Here, it is necessary to deal with this Null value. For variables, you can enter information according to shop_id, item, and category. On the other hand, although it is the target value and the value that is the base point, it is information that is not in the sales history and it is assumed that it is a combination that does not have sales in the first place, so we decided to fill it with 0 this time.

In [None]:
# Trainin data
# shop data
Train = pd.merge(Base, shop_time_count_train, left_on="shop_id", right_on="id", how="left")
Train = pd.merge(Train, shop_time_price_train, left_on="shop_id", right_on="id", how="left")

# category data
Train = pd.merge(Train, cate_time_count_train, left_on="item_category_id", right_on="id", how="left")
Train = pd.merge(Train, cate_time_price_train, left_on="item_category_id", right_on="id", how="left")

# item data
Train = pd.merge(Train, item_time_count_train, left_on="item_id", right_on="id", how="left")
Train = pd.merge(Train, item_time_price_train, left_on="item_id", right_on="id", how="left")

Train.fillna(0, inplace=True)

Train.head()

In [None]:
# Divide learning data into explanatory variables and target values
# Train data
X_Train = Train.drop(["ID", "item_id", 33, 20, "id_x", "id_y", "shop_id", "item_category_id"], axis=1)

y_Train = Train[33].clip(0,20)

In [None]:
# # Test data
# shop data
Test = pd.merge(Base, shop_time_count_test, left_on="shop_id", right_on="id", how="left")
Test = pd.merge(Test, shop_time_price_test, left_on="shop_id", right_on="id", how="left")
# category data
Test = pd.merge(Test, cate_time_count_test, left_on="item_category_id", right_on="id", how="left")
Test = pd.merge(Test, cate_time_price_test, left_on="item_category_id", right_on="id", how="left")
# item data
Test = pd.merge(Test, item_time_count_test, left_on="item_id", right_on="id", how="left")
Test = pd.merge(Test, item_time_price_test, left_on="item_id", right_on="id", how="left")

Test.fillna(0, inplace=True)

In [None]:
Test.head()

In [None]:
# Divide learning data into explanatory variables
# Test data
X_Test = Test.drop(["ID", "item_id", 32, 20, "id_x", "id_y", "shop_id", "item_category_id"], axis=1)

In [None]:
print("X_Train shape:{}".format(X_Train.shape))
print("y_Train shape:{}".format(y_Train.shape))
print("X_Test shape:{}".format(X_Test.shape))

## Trainin and validation

The training data is divided into model training and evaluation data.

In [None]:
# Train test data split
X_train, X_val, y_train, y_val = train_test_split(X_Train, y_Train, test_size=0.2, random_state=10)

The models used for machine learning this time are LGBM for nonlinear prediction, and Ridge for linear prediction.<br>
Each result was confirmed by residuals and plots, and the results that came out were ensembled to be the final predicted values.

### Light GBM

In [None]:
# Create instance
lgbm = lgb.LGBMRegressor()

params = {'learning_rate': [0.14, 0.18, 0.20], 'max_depth': [8, 10, 12]}

# Fitting
cv_lg = GridSearchCV(lgbm, params, cv = 10, n_jobs =1)
cv_lg.fit(X_train, y_train)

print("Best params:{}".format(cv_lg.best_params_))

best_lg = cv_lg.best_estimator_

# prediction
y_train_pred_lg = best_lg.predict(X_train)
y_val_pred_lg = best_lg.predict(X_val)

# prediction
y_train_pred_lg = cv_lg.predict(X_train)
y_val_pred_lg = cv_lg.predict(X_val)

print("MSE train:{}".format(mean_squared_error(y_train, y_train_pred_lg)))
print("MSE val;{}".format(mean_squared_error(y_val, y_val_pred_lg)))

print("R2 score train:{}".format(r2_score(y_train, y_train_pred_lg)))
print("R2 score val:{}".format(r2_score(y_val, y_val_pred_lg)))

In [None]:
# Training and score
ridge = Ridge()
params = {'alpha': [10000, 3000, 2000, 1000, 100, 10, 1]}

# Fitting
cv_r = GridSearchCV(ridge, params, cv = 10, n_jobs =1)
cv_r.fit(X_train, y_train)

print("Best params:{}".format(cv_r.best_params_))

best_r = cv_r.best_estimator_

# prediction
y_train_pred_r = best_r.predict(X_train)
y_val_pred_r = best_r.predict(X_val)

print("MSE train:{}".format(mean_squared_error(y_train, y_train_pred_r)))
print("MSE val;{}".format(mean_squared_error(y_val, y_val_pred_r)))

print("R2 score train:{}".format(r2_score(y_train, y_train_pred_r)))
print("R2 score val:{}".format(r2_score(y_val, y_val_pred_r)))

## Val data check

In [None]:
plt.figure(figsize=(6,6))
plt.scatter(y_val_pred_lg, y_val_pred_lg - y_val, c="red", marker='o', alpha=0.5, label="LGBM")
plt.scatter(y_val_pred_r, y_val_pred_r - y_val, c="green", marker='o', alpha=0.5, label="Rigde")
plt.xlabel('Predicted values')
plt.ylabel('Residuals')
plt.legend(loc = 'upper left')

Looking at the results, you can see that it is quite difficult to predict. The residuals are not uniform and are biased towards the positive side.

In [None]:
plt.figure(figsize=(6,6))
plt.scatter(y_val.clip(0,20), y_val_pred_lg.clip(0,20), c="red", marker='o', alpha=0.5, label="LGBM")
plt.scatter(y_val.clip(0,20), y_val_pred_r.clip(0,20), c="green", marker='o', alpha=0.2, label="Rigde")
plt.xlabel('y_val data')
plt.ylabel('y_predcition')
plt.xlim([-2,22])
plt.ylim([-2,22])
plt.legend(loc = 'upper left')

print("MSE val LGBM:{}".format(mean_squared_error(y_val.clip(0,20), y_val_pred_lg.clip(0,20))))
print("MSE val Ridge:{}".format(mean_squared_error(y_val.clip(0,20), y_val_pred_r.clip(0,20))))

Looking at the clipped results, we can see that LGBM has lower prediction MSE. In the result of not clipping, it was the opposite, but it was found that LGBM is the best in terms of submission rules.

The final predicted results were 

In [None]:
## Test prediction
y_test_pred = best_lg.predict(X_Test).clip(0,20)

In [None]:
# Predictin visualization
plt.figure(figsize=(10,6))
sns.distplot(y_test_pred, kde=False, bins=20)
plt.xlabel("prediction")
plt.xlim([-0.5,20.5])
plt.xticks(range(21))
plt.ylabel("Frequency")
plt.yscale("log")

In [None]:
# submit dataframe
submit = sample.copy()
submit["item_cnt_month"] = y_test_pred

submit.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")