# Store Item Prediction - exploration

The goal is to predict sales for the next three month for each item and store.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

import matplotlib.pyplot as plt

In [None]:
## Get data
## Load data
data = pd.read_csv("/kaggle/input/demand-forecasting-kernels-only/train.csv")

## Convert datatypes
data['date'] = pd.to_datetime(data['date'])

## Data exploration

There are 913000 samples. The data range from Jan 2013 to Dec 2017. There are 10 different stores and 50 different items. The sales range from 0 to 231 and follow a chi-square distribution, see below. It is similar for all stores.

In [None]:
data.describe()

In [None]:
print("Sales range from {} to {}.".format(data['sales'].min(), data['sales'].max()))
plt.hist(data['sales'], color='navy')
plt.title("Sales for all stores")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()

In [None]:
data.dtypes

In [None]:
data.head()

In [None]:
data.isnull().sum()

In [None]:
print("Data from {} to {}".format(data['date'].min(), data['date'].max()))
print("Number of samples: {}".format(len(data)))
print("Stores: {}".format(data['store'].unique().tolist()))
print("Different items: {}".format(data['item'].unique()))

We want to predict sales for the next three month. So let's look at sales over the given time range of one item from one store.

In [None]:
from pandas.plotting import register_matplotlib_converters

## Sales of one item from one store
plt.figure(figsize=(20,10))

for i in range(4):
    plt.subplot(2, 2, i+1)
    data_slice = data.loc[data['item'] == 3+i]
    data_slice = data_slice.loc[data_slice['store'] == 2+i]
    plt.plot(data_slice['date'], data_slice['sales'], color='navy')

plt.show()

For an item at a store, there is seasonability. The overall trend per year looks similar for most/all? combinations with different levels of noise. Note that no outliers have been filtered yet.

### Pattern identification
There seems to be trend as well as seasonality in the sales on a year scale. Let's investigate these findings in more detail. We also look at patterns on a month and week scale.

In [None]:
pd.plotting.lag_plot(data['sales'])
plt.show()

In [None]:
plt.figure(figsize=[20,10])
pd.plotting.autocorrelation_plot(data.loc[data['store'] == int(3)].loc[data['item'] == int(39)]['sales'])
plt.show()

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
test = data.loc[data['store'] == int(3)].loc[data['item'] == int(39)]
test.index = test.date
test = test['sales']

decomposed = seasonal_decompose(test, model='additive')
x = decomposed.plot()

In the lag plot above you can see that the sales are correlated under a lag of 1 between the sales. The second plot shows the autocorrelation under increasing lag. Here you can see the seasonality as well as the trend in the decreasing magnitude. From the third plot four plots you can see that there is a trend on a month scale and a year scale. Next we will investigate these scales in more detail.

### Year scale

Looking at the year 2015 we can see seasonanilty.

In [None]:
data_year = data.loc[(data['date'] >= '2015-01-01') & (data['date'] < '2016-01-01')]
data_year.loc[data_year['store'] == 8].loc[data_year['item'] == 35]['sales'].plot()

In [None]:
def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

plt.figure(figsize=(20,10))
weeks = [1, 4, 8, 12]
sales = data.loc[data['store'] == 8].loc[data['item'] == 35]['sales'].tolist()

for i in range(4):
    plt.subplot(2, 2, i+1)
    res = moving_average(sales, n=7*(weeks[i]))
    plt.plot(res)
    
plt.show()

The rolling averages over a week and a month show that there seem to be two seasonality scales. The sales seem to vary thoughout a week as well as a month. 

From the rolling average over three month you can see the trend of overall increasing sales.

Let's look at the variation within a month in more detail.

### Month scale

For May 2015 we can see the inner-month seasonality. There is a variation of sales throughout a week.

In [None]:
data_month = data_year.loc[(data_year['date'] >= '2015-05-01') & (data_year['date'] < '2015-06-01')]
data_month.loc[data_month['store'] == 8].loc[data_month['item'] == 35]['sales'].plot()

In [None]:
test = data_month.loc[data_month['store'] == int(3)].loc[data_month['item'] == int(39)]
test.index = test.date
test = test['sales']

decomposed = seasonal_decompose(test, model='additive')
x = decomposed.plot()

## Conclusion

The data is complete and clean. There is a trend of increasing sales over the given years. For any combination of store and item there is an increase in sales in the first half of the year and a decrease in the second half. 