In order to improve the estimation accuracy of the model, it is important to understand what characteristics the data feature has. This time, we analyzed the relationship between each data and the target sales. <br>

Next under stury <br>
Based on this data, we consider the preprocessing of the data and the selection and creation of important features, and build the model.

In [None]:
# Basic libraries
import pandas as pd
import numpy as np
import time
import datetime
import gc

# Data preprocessing
import category_encoders as ce

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")

# Time series analysis
from statsmodels.graphics.tsaplots import plot_acf
import statsmodels.api as sm

# Normality test
from scipy import stats

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data loading

In [None]:
# Dataloading
path = "/kaggle/input/m5-forecasting-accuracy/"

calendar = pd.read_csv(os.path.join(path,"calendar.csv"))
train = pd.read_csv(os.path.join(path, "sales_train_validation.csv"))
price = pd.read_csv(os.path.join(path, "sell_prices.csv"))

In [None]:
# Change dtype to light
# Calendar data
def dtype_ch_calendar(df):
    # Columns name
    int16_col = ["wday", "month", "snap_CA", "snap_TX", "snap_WI","wm_yr_wk", "year"]

    # dtype change
    df["date"] = pd.to_datetime(df["date"])
    df[int16_col] = df[int16_col].astype("int16")

    return df

# price data
def dtype_ch_price(df):
    # Columns name
    int16_col = ["wm_yr_wk"]
    float16_col = ["sell_price"]

    # dtype change
    df[int16_col] = df[int16_col].astype("int16")
    df[float16_col] = df[float16_col].astype("float16")

    return df

# train data
def dtype_ch_train(df):
    # Columns name
    int16_col = df.loc[:,"d_1":].columns
    # dtype change
    df[int16_col] = df[int16_col].astype("int16")

    return df

In [None]:
def create_features_calendar(df):
    # Change dtype to light
    df = dtype_ch_calendar(df)

    # day of month variable
    df["mday"] = df["date"].dt.day.astype("int16")

    # event object to numerical, ordinal encoder
    list_col = ["event_name_1", "event_type_1", "event_name_2", "event_type_2"]
    for i in list_col:
        ce_oe = ce.OrdinalEncoder(cols=i, handle_unknown='impute') # Create instance of OrdinalEncoder
        df = ce_oe.fit_transform(df)
        df[i] = df[i].astype("int16") # change to light dtype
        
    return df

In [None]:
# dtype change to light
calendar = create_features_calendar(dtype_ch_calendar(calendar))
price = dtype_ch_price(price)
train = dtype_ch_train(train)

In [None]:
# Data merge
def data_merge_3df(train, calendar, price):
    df = pd.DataFrame({})
    id_col = ["id", "item_id", "dept_id", "cat_id", "store_id", "state_id"]
    df = train.melt(id_vars=id_col, var_name="d", value_name="volume")
    df.drop(["id", "cat_id", "state_id"], axis=1, inplace=True)

    # calendar data merge
    df = pd.merge(df, calendar, left_on="d", right_on="d", how="left")
    # price data merge
    df = pd.merge(df, price, left_on=["store_id", "item_id", "wm_yr_wk"], right_on=["store_id", "item_id", "wm_yr_wk"], how='left')
    
    df.drop("wm_yr_wk", axis=1, inplace=True)

    gc.collect()

    return df

In [None]:
gc.collect()

In [None]:
# Create merged dataframe
master = data_merge_3df(train, calendar, price)

In [None]:
del calendar
gc.collect()

# Features and EDA

My Model training plan <br>
*Creating model is not including this kernel.

### Target data : volume value <br>
### Train data : Using only before 28days information from target data date. <br>

Analysis of target data relation ship <br>

1) category features : "item_id", "dept_id", "cat_id", "store_id", "state_id" <br>

- item_id : N=3,049, volume(item_id average) distribution on latest day.

- dept_id : N=7 separated from cat_id, volume distribution on latest day with boxplot<br>

- cat_id : N=3 ↑<br>

- store_id : N=10 separated from state_id, volume distribution on latest day with boxplot<br>

- state_id : N=3 ↑<br>

- dept_id and state_id and volume(average) of latest day with bubble plot

2) price features : "sell_price"
- sell_price : price distribution with boxplot on latest day & scatter plot vs volume
 
3-1) day features : "year", "month", "day of month", "weekday" <br>

- year : 2011 ~ 2016 vs volume(Average) with boxplot<br>

- month : 1 ~ 12 vs volume(Average) with boxplot <br>

- day of month : 1 ~ vs volume(Average) with boxplot<br>

- weekday : 1 ~ 7 (Saturday ~ Friday ) vs volume(Average) with boxplot <br>

3-2) Time series

- Target time series analysis, autocorrelation plot

- Resid of time series volume, check by Normality test

# Create EDA Functions

In [None]:
# Distribution plot function
def distribution_plot(df, col_name="item_id", target_value="d_1913"):
    value = df.groupby(col_name)[target_value].mean().values

    # Visualization
    plt.figure(figsize=(10,6))
    sns.distplot(value)
    plt.xlabel("Volume")
    plt.ylabel("Frequency")
    plt.title("Volume distribution by each item_id at {}".format(target_value))
    
# Box plot function
def box_plot(x, y, df, size=(20,6), y_label="Volume", stripplot=True):
    fig, ax = plt.subplots(1, 2, figsize=size)

    # Including outliers
    if stripplot == True:
        sns.boxplot(x=x, y=y, data=df, showfliers=False, ax=ax[0])
        sns.stripplot(x=x, y=y, data=df, jitter=True, ax=ax[0])
    else:
        sns.boxplot(x=x, y=y, data=df, ax=ax[0])
    ax[0].set_ylabel(y_label)
    ax[0].set_title("box plot at {} with outliers".format(y))
    ax[0].tick_params(axis='x', labelrotation=45)
    
    # Not including outliers
    sns.boxplot(x=x, y=y, data=df, ax=ax[1], sym="")
    ax[1].set_ylabel(y_label)
    ax[1].set_title("box plot at {} without outliers".format(y))
    ax[1].tick_params(axis='x', labelrotation=45)

# 2 params bubble plot function
def bubble_plot(x="store_id", y="dept_id", s="d_1913", df=train, size=(20,10)):
    # mean value
    data_ave = df.groupby([x,y])[s].mean().reset_index()
    x_ave = data_ave[x]
    y_ave = data_ave[y]
    s_ave = data_ave[s]

    # max values
    data_max = df.groupby([x,y])[s].max().reset_index()
    x_max = data_max[x]
    y_max = data_max[y]
    s_max = data_max[s]

    # visualization
    fig, ax = plt.subplots(1, 2, figsize=size)

    ax[0].scatter(x_ave, y_ave, s=s_ave*100, alpha=0.5, color="blue")
    ax[0].set_xlabel(x)
    ax[0].set_ylabel(y)
    ax[0].set_title("Bubble chart of average volume on {}".format(s))

    ax[1].scatter(x_max, y_max, s=s_max*100, alpha=0.5, color="green")
    ax[1].set_xlabel(x)
    ax[1].set_ylabel(y)
    ax[1].set_title("Bubble chart of average volume on {}".format(s))

# correlation plot
def correlation_plot(x="sell_price", y="volume", df=master, sample_size=3000, size=(8,8)):
    samp = df.sample(sample_size)
    x_data = samp[x]
    y_data = samp[y]
    
    plt.figure(figsize=size)
    plt.scatter(x_data,y_data)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title(x+"vs"+y + ", Sampling {}".format(sample_size))

In [None]:
# Time series
def time_series_plot(data, freq=28, size=(20,12), title=""):
    index = data.index
    col = data.columns
    
    fig, ax = plt.subplots(2,1, figsize=size)
    
    # Raw data
    for i in col:
        ax[0].plot(index, data[i], label=i, linewidth=1)
        ax[0].legend()
        ax[0].set_title("{} : Time series plot of raw data".format(title))
    
    # Rolling mean data
    for i in col:
        ax[1].plot(index, data[i].rolling(freq).mean(), label=i, linewidth=1)
        ax[1].legend()
        ax[1].set_title("{} : Time series plot of rolling {} data".format(title, freq))

# R coefficient plot 
def r_coef_plot(df, max_lag=72, size=(20,6)):
    col = df.loc[:, "d_1":].columns
    
    r_corr = []
    lag = range(len(col))
    
    for i in range(len(col)):
        x = df.iloc[:,-1]
        y = df[col[-i-1]]
        r = np.corrcoef(x,y)[0,1]
        r_corr.append(r)
        
    fig, ax = plt.subplots(1,2, figsize=size)
    
    ax[0].plot(lag, r_corr)
    ax[0].set_xlabel("lag")
    ax[0].set_ylabel("R coefficient")
    ax[0].set_ylim([0,1])
    ax[0].set_title("Lag volume R coefficient")
    
    ax[1].plot(lag[:max_lag], r_corr[:max_lag])
    ax[1].set_xlabel("lag")
    ax[1].set_ylabel("R coefficient")
    ax[1].set_ylim([0,1])
    ax[1].set_title("Lag volume (Max lag range {}) R coefficient".format(max_lag))

# Auto correlation plot
def autocorrelation_plot(data, lags=28):
    col = data.columns
    fig, ax = plt.subplots(len(col), 2, figsize=(20, 6*len(col)))
    
    for c in range(len(col)):                 
        # autocorrelation
        plot_acf(data[col[c]], lags=lags, ax=ax[c,0])
        ax[c,0].set_title("Auto correlation of {}".format(col[c]))
        ax[c,0].set_xlabel("lag")
        ax[c,0].set_ylabel("auto correlation")
        # time series
        ax[c,1].plot(data[col[c]][-365:].index, data[col[c]][-365:], linewidth=1)
        ax[c,1].set_title("Time series data of {}".format(col[c]))
        ax[c,1].set_xlabel("day")
        ax[c,1].set_ylabel("volume")
        
    plt.show()

# Resid normality test
def normality_test(df, freq=7):
    col_name = df.columns
    resid_df = pd.DataFrame({})
    
    for c in col_name:
        res = sm.tsa.seasonal_decompose(df[c], period=freq)
        resid_df["Resid_{}".format(c)] = res.resid
        
    resid_df.dropna(inplace=True)
        
    fig, ax = plt.subplots(resid_df.shape[1], 2, figsize=(20, 6*resid_df.shape[1]))
    plt.subplots_adjust(hspace=0.4)
    col_name = resid_df.columns
    for i in range(len(resid_df.columns)):
        # Shapiro wilk test
        WS, p = stats.shapiro(resid_df[col_name[i]])
        # distribution plot
        sns.distplot(resid_df[col_name[i]], ax=ax[i, 0])
        ax[i, 0].set_xlabel("resid")
        ax[i, 0].set_title("Distribution of resid : {} \n p-value of Shapiro Wilk test : {:.3f}".format(col_name[i], p))
        # probability
        stats.probplot(resid_df[col_name[i]], plot=ax[i,1])
        ax[i, 1].set_title("Probability plot")

# Exploratory data analysis

## item_id 
N=3,049, volume(item_id average) distribution on latest day

In [None]:
distribution_plot(train, col_name="item_id", target_value="d_1913")

- The sales volume for each item has a wide range. There are a wide range of items, especially those with 10 or more, but the frequency is very low. These are considered to be difficult to predict due to their very small sample and low data density. <br>
- First, it is necessary to understand what kind of characteristics the data is located in this skirt.

## dept_id
dept_id : N=7 separated from cat_id, volume distribution on latest day with boxplot

In [None]:
# dept_id : N=7 separated from cat_id, volume distribution on latest day with boxplot
box_plot(x="dept_id", y="d_1913", df=train, size=(20,6), y_label="Volume", stripplot=True)

This is a boxplot of sales volume of final data by dept_id.
- First, most of the sales are 0 to 2. On the other hand, many outliers are also observed, and the number is wide, ranging from 10 to a maximum of 120.

- It can also be seen that quite a lot of products have 0 sales. Predicting that it is 0 is a very important factor.

It seems that the feature quantity of dept_id is necessary.

## store_id
store_id : N=10 separated from state_id, volume distribution on latest day with by boxplot

In [None]:
box_plot(x="store_id", y="d_1913", df=train, size=(20,6), y_label="Volume", stripplot=True)

This tends to differ in sales volume by store_id.
- CA_1,2,3 and WI_1,2 are sold a lot. On the other hand, CA_4, TX_1,2,3 and WI_3 are few in sales.
- Although there are outliers in all stores, CA_3 is particularly large and conversely CA_4 is small.

It seems that the feature quantity of dept_id is also required.

## dept_id and state_id
dept_id and state_id and volume(average and max) of latest day with bubble plot

In [None]:
bubble_plot(x="store_id", y="dept_id", s="d_1913", df=train, size=(20,6))

This figure visualizes the average and maximum sales volume by combining store_id and dept_id.

- Each average is large in CA_1,2,3 of HOUSEHOLD_1 and FOODS_3. FOODS_3 has a large difference between the average and the maximum value.
- We can see that the difference between the average and the maximum value is not constant, but changes depending on the combination of dept_id and store_id.

I will create a feature that combines the two.

In [None]:
# for keeping memory
gc.collect()

## price
price distribution with boxplot on latest day & scatter plot vs volume

In [None]:
# price distribution with boxplot on latest day
box_plot(x="store_id", y="sell_price", df=price, size=(20,6), y_label="Price", stripplot=False)

This figure is a box plot of store_id and price.

-We can see that the average price does not change much at any store. However, some stores have large outliers, and the size is very large.

In [None]:
# create dept_id from item_id
price_copy = price.copy()
price_copy["dept_id"] = [s.rsplit("_",1)[0] for s in price_copy["item_id"]]

box_plot(x="dept_id", y="sell_price", df=price_copy, size=(20,6), y_label="Price", stripplot=False)

del price_copy
gc.collect()

- By dept_id, we can see that the price range will change significantly. In particular, HOUSEHOLD_2 has a wide price range, and it can be seen that large outliers also belong to this category.
- we can see that HOBBIES_1 is expensive on average.

In [None]:
del price
gc.collect()

In [None]:
# Correlation with volume
correlation_plot(x="sell_price", y="volume", df=master, sample_size=1500, size=(8,8))

This figure is a plot of the relationship between price and sales volume.

- It can be seen that the price of a product with a large sales volume is low. It can be seen that the number of sales does not increase significantly as the price goes up.

## day features : "year", "month", "day of month", "weekday", "event", "snap"<br>

- year : 2011 ~ 2016 vs volume(Average) <br>

- month : 1 ~ 12 vs volume(Average) <br>

- day of month : 1 ~ vs volume(Average)<br>

- weekday : 1 ~ 7 (Saturday ~ Friday ) vs volume(Average) <br>

- snap and event flag vs volume

## year

In [None]:
# year
box_plot(x="year", y="volume", df=master, size=(20,6), y_label="Volume", stripplot=False)

Plotted box plots by year.

- Most of the sales volume is 0 to 2, but there are large outliers in each year.

- Although the period may be short in 2016, the number of large sales is getting smaller as the years go by.

Since the volume of sales volume may vary from year to year, we analyzed the 2015 data for the subsequent time variables.

## month

In [None]:
# month
box_plot(x="month", y="volume", df=master[master["year"]==2015], size=(20,6), y_label="Volume", stripplot=False)

Looking at the overall average, it is not that there are many specific months.

## weekday

In [None]:
# weekday
box_plot(x="weekday", y="volume", df=master[master["year"]==2015], size=(20,6), y_label="Volume", stripplot=False)

It can be seen that there are large sales volumes on Saturday and Sunday. However, outliers such as the maximum number do not mean that there are many Saturdays and Sundays.

## day of month

In [None]:
master["day_of_month"] = pd.to_datetime(master["date"]).dt.day

# day of month
box_plot(x="day_of_month", y="volume", df=master[master["year"]==2015], size=(20,6), y_label="Volume", stripplot=False)

Looking at the date of the month, there is no characteristic that it is often found at the beginning or end of the month.

In [None]:
gc.collect()

## snap_flg

In [None]:
# Omitted due to memory over on kaggle
# snap
# master["snap_flag"] = master["snap_CA"] + master["snap_TX"] + master["snap_WI"]

# box_plot(x="snap_flag", y="volume", df=master[master["year"]==2015], size=(20,6), y_label="Volume", stripplot=False)

Regarding snap_flag, there is no difference in sales volume for each flag.

## event_flg

In [None]:
# Omitted due to memory over on kaggle
# event_name
# master[["event_name_1", "event_name_2"]] = master[["event_name_1", "event_name_2"]].fillna("no")
# master["event_flag"] = master["event_name_1"] + str("+") + master["event_name_2"]

In [None]:
# Omitted due to memory over on kaggle
# box_plot(x="event_flag", y="volume", df=master[master["year"]==2015], size=(20,6), y_label="Volume", stripplot=False)

We can see that sales are increasing at specific events. event_flag is likely to be an important feature amount in sales forecast.

In [None]:
gc.collect()

## Time series analysis

### Correlation coefficient between objective variable and lag sales

In [None]:
r_coef_plot(train, max_lag=72, size=(20,6))

First, I confirmed the correlation between the latest date sales and the previous sales data.<br>
- As a result, it can be seen that the correlation coefficient decreases as the dates depart. It was confirmed that the information for the forecast data is closer to the data that is closer to the date and time as possible.
- At a certain point, there is a point at which the correlation coefficient decreases periodically. It is speculated that this is due to a specific day of the year (see below).

## Sample item (item_id)

In [None]:
# Sampling items creation
sample_num = [1, 100, 1000, 10000]
sample_id = []
for i in sample_num:
    id_name = train["item_id"].values[i]
    sample_id.append(id_name)

sample_df = master[(master["item_id"] == sample_id[0]) | (master["item_id"] == sample_id[1]) | (master["item_id"] == sample_id[2])| (master["item_id"] == sample_id[3])]
sample_df = pd.pivot_table(sample_df, index="date", columns="item_id", values="volume", aggfunc="mean")

In [None]:
time_series_plot(sample_df, freq=7, size=(20,12), title="item_id")

In [None]:
# Sample id
autocorrelation_plot(data=sample_df, lags=56)

In [None]:
# Normality test of sample 
normality_test(sample_df, freq=7)

In [None]:
normality_test(sample_df, freq=28)

In [None]:
del sample_df
gc.collect()

Several samples were taken to visualize the time series.

- Looking at the results, it can be seen that the overall sales fluctuate like noise. It is necessary to understand the cycle.

- Also, the tendency is different for each one. In some cases, it can be said that sales have dropped significantly during a certain period.

- It can be seen that the cycle of increase and decrease in sales is mostly a 7-day cycle. As for the lag feature, it is considered better to adopt a 7-day cycle.

- Checking the residual with the moving average, we can see that the normality is high around 0, but the tail of the distribution is large, and the normality of the outliers is low. As can be seen from the results obtained so far, it is not enough to obtain information about the lag feature amount that is the moving average, and it is necessary to include information such as other prices and specific dates.

## dept_id

In [None]:
# dept_id
dept_df = pd.pivot_table(master, index="date", columns="dept_id", values="volume", aggfunc="mean")

In [None]:
time_series_plot(dept_df, freq=7, size=(20,12), title="dept_id")

In [None]:
time_series_plot(dept_df, freq=28, size=(20,12), title="dept_id")

In [None]:
# dept_id
autocorrelation_plot(data=dept_df, lags=56)

In [None]:
# dept_df
normality_test(dept_df, freq=7);

In [None]:
# dept_df
normality_test(dept_df, freq=28);

In [None]:
del dept_df
gc.collect()

The average value by dept_id is plotted.

- We can see that there are big depressions throughout the year.

- The periodicity seems to be able to grasp the weekly cycle by removing noise with a 7-day moving average. Also, by taking a moving average on a 28-day cycle, it seems that overall sales fluctuations can be captured.

- Looking at the data for the last year, We can see a big drop just before 2016-1. It is speculated that this is a Christmas holiday.

- Many of the residuals from the moving average graph have a long left tail. It is speculated that this is due to the drop in particular days. Even if a moving average is taken, it is necessary to be careful in handling data on a specific day when such a large drop occurs when creating a feature amount.

## store_id

In [None]:
# store_id
store_df = pd.pivot_table(master, index="date", columns="store_id", values="volume", aggfunc="mean")

In [None]:
time_series_plot(store_df, freq=28, size=(20,12), title="store_id")

In [None]:
time_series_plot(store_df, freq=28, size=(20,12), title="store_id")

In [None]:
# store_id
autocorrelation_plot(data=store_df, lags=56)

In [None]:
# store_df
normality_test(store_df, freq=7)

In [None]:
normality_test(store_df, freq=28)

In [None]:
del store_df
gc.collect()

- Looking at the time-series graphs for each store, in addition to the big drop in Christmas as before, we were able to confirm the drop in specific stores. There may be a specific day like a store holiday.

- The autocorrelation coefficient also has a strong 7-day cycle. Only in WI_2 and WI_3, the strength of the 28-day cycle is noticeable.

- Similar to dept_id, there are many graphs with long left tails as the residuals from the moving average graph.

# Summery

- Most of the forecast sales volume is 0 or 1 or 2. However, the range above that is very wide. Therefore, the larger the value, the lower the density of the training model, so some improvement is required to improve the estimation accuracy.

- Sales volume varies in specific categories. However, many of the differences are small, and 75% of the data are between 0 and 3. There is a category that tends to take a large value, and it is necessary to include data that can separate its characteristics.

- Price range and category are related. Price and sales volume are also related, but not constant. Cheaper products may sell more, but not necessarily.

- From the time series data, it is better to select the lag feature amount that is closer in time than the yearly period. The cycle is strong for 7 days and 28 days.

- Raw data if the 7-day cycle is captured. 7-day moving average data may be included if the 28-day cycle is captured. The overall trend is likely to be captured using 28-day moving average data.

- A big drop in sales on certain days like vacation is a strong noise. Strong noise days such as Christmas are not included in the prediction range, but caution should be exercised when creating a lag feature and building a model.

Next) <br>
Data cleaning and feature quantities are created to build a prediction model, and the results are evaluated.

### I'm having a hard time building the model myself and the accuracy is not so increasing, but I hope this notebook will be helpful to someone.