# Demand prediction for multi-store and multi-item

This kernel is for Kaggle's Store Item Demand Forecasting Challenge

## Data Description
The objective of this competition is to predict 3 months of item-level sales data at different store locations.

File descriptions

* train.csv - Training data
* test.csv - Test data (Note: the Public/Private split is time based)
* sample_submission.csv - a sample submission file in the correct format

Data fields

* date - Date of the sale data. There are no holiday effects or store closures.
* store - Store ID
* item - Item ID
* sales - Number of items sold at a particular store on a particular date.

# Exploratory Data Analysis (Data understanding)

- Quick viewing of given raw data before importing
- First glance at given data set
 - Check shape of data, columns, index
 - Viewing raw data
 - Check NaN
 - Check describe
- Pivotal analysis
- Check ECDF: empirical cumulative distribution function
- Check Histgram
- Check trend
- Check timeseries plot
- Conclusion of EDA

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Quick viewing of given raw data before importing

In [None]:
!echo "Quick viewing of given raw data"
!echo "## train.csv ## " ; head ../input/train.csv; echo "..." ; tail ../input/train.csv ; wc -l ../input/train.csv ; echo
!echo "## test.csv ##"   ; head ../input/test.csv ; echo "..." ; tail ../input/test.csv ; wc -l ../input/test.csv  ; echo
!echo "## sample_submission.csv ##";head ../input/sample_submission.csv;echo "..." ; tail ../input/sample_submission.csv ;wc -l ../input/sample_submission.csv

### result of quick viewing
- Data have header
- train.csv has three columns
- test.csv data has three columns, but has ID column instead of sales
- sample_submission.csv has two columns. it's id and sales.
- Number of rows of test.csv and number of rows of sample_submission.csv are same. 
- Maybe, test.csv is test_X, and sample_submission.sales is test_y.
- training period : 2013-01-01 to 2017-12-31 (5 years)
- test period : 2018-01-01 to 2018-03-31 (3 month)

# First glance at given data set
In this section we go through given data, handle missing values

In [None]:
# import related libraries

# dates
from pandas import datetime

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns # advanced vizs
%matplotlib inline

# statistics
from statsmodels.distributions.empirical_distribution import ECDF

# time series analysis
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# prophet by Facebook
from fbprophet import Prophet

In [None]:
# Import data
train_data_csv = "../input/train.csv"
test_data_csv = "../input/test.csv"
sample_submission_csv = "../input/sample_submission.csv"

train = pd.read_csv(train_data_csv, parse_dates = True,
                    low_memory = False, index_col = 'date')
test = pd.read_csv(test_data_csv, parse_dates = True,
                   low_memory = False, index_col = 'date')
submission = pd.read_csv(sample_submission_csv)

### Check shape of data, columns, index

In [None]:
print("Check imported data")
print()
print("In total:")
print("train.shape {} ".format(train.shape))
print("test.shape {} ".format(test.shape))
print("submission.shape {} ".format(submission.shape))
print()
print("train.columns {} ".format(train.columns))
print("test.colmuns {} ".format(test.columns))
print("submission.colmuns {} ".format(submission.columns))
print()
print("train.index {} ".format(train.index))
print("test.index {} ".format(test.index))
print("submission.index {} ".format(submission.index))



### Viewing raw data
It is important.

In [None]:
pd.set_option("display.max_rows", 20)

In [None]:
train.head(500)

In [None]:
test.head(500)

In [None]:
submission.head(500)

### Check NaN

In [None]:
# rows which contains NA column
train[train.isna().any(axis=1)]

In [None]:
# rows which contains NA column
test[test.isna().any(axis=1)]

In [None]:
# rows which contains NA column
submission[submission.isna().any(axis=1)]

train, test and submission data do not contain NA value

### Check describe

In [None]:
# describe - note, store and item are factor
train.describe()

Minimum sales value is 0.
It is necessary to check distribution of sales values.

In [None]:
# describe - note, store and item are factor
test.describe()

In [None]:
# describe - note, this submission data is sample
submission.describe()

all sales value is 52 as sample.

## Pivotal analysis

In [None]:
pd.set_option("display.precision", 1)

In [None]:
# Pivot
pd.pivot_table(train, index='item', columns='store', aggfunc='count')

In [None]:
pd.pivot_table(train, index='item', columns='store', aggfunc='min')

In [None]:
pd.pivot_table(train, index='item', columns='store', aggfunc='max')

In [None]:
pd.pivot_table(train, index='item', columns='store', aggfunc='median')

## ECDF: empirical cumulative distribution function

In [None]:
sns.set(style = "ticks")# to format into seaborn 
c = '#386B7F' # basic color for plots
plt.figure(figsize = (12, 13))

plt.subplot(311)
cdf = ECDF(train['store'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('store'); plt.ylabel('ECDF');

plt.subplot(312)
cdf = ECDF(train['item'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('item'); plt.ylabel('ECDF');

plt.subplot(313)
cdf = ECDF(train['sales'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('sales'); plt.ylabel('ECDF');



## Check Histgram

In [None]:
train['store'].hist()

In [None]:
train['item'].hist()

In [None]:
train['sales'].hist()

In [None]:
# check small sales values
train[train['sales'] < 2]

only one row has 0 sales value. store ID is 6 and item ID is 4.

## Creating new feature for farther analysis

In [None]:
# data extraction
train['Year'] = train.index.year
train['Month'] = train.index.month
train['Day'] = train.index.day
train['WeekOfYear'] = train.index.weekofyear
train['DayOfYear'] = train.index.dayofweek
train['is_month_start'] = train.index.is_month_start
train['is_month_end'] = train.index.is_month_end
train['is_month_end'] = train.index.is_month_end
train['days_from_epoch'] = (train.index - pd.Timestamp("1970-01-01")).days

## Check trend

In [None]:
# sales trends
sns.catplot(data = train, x = 'Year', y = "sales", kind='point')

It looks like increasing year by year

In [None]:
# sales trends
# sns.factorplot(data = train, x = 'Month', y = "sales")
sns.catplot(data = train, x = 'Month', y = "sales", kind='point')

Their items sold well in summer.

In [None]:
# sales trends
# sns.factorplot(data = train, x = 'Day', y = "sales")
sns.catplot(data = train, x = 'Day', y = "sales", kind='point')

Month-end is good timing to sale.

In [None]:
# sales trends, for each store
sns.catplot(data = train, x = 'Year', y = "sales", col='store', kind='point')

In [None]:
# sales trends, for each item
sns.catplot(data = train, x = 'Year', y = "sales", row='item', kind='point')

In [None]:
# sales trends, for each store x item
sns.catplot(data = train, x = 'Year', y = "sales",
            row = 'item', col='store', kind='point')

### Result of factor plot
- Sales values are increasing.
- Each items and shop has a individual increase rate.

### Check timeseries plot

In [None]:
# timeseries plot
def tsplot(tsdf, title):
    from scipy import signal
    t = tsdf.index
    y = tsdf['sales']
    yd = signal.detrend(y)
    plt.figure(figsize=(4,3))
    plt.plot(t, y, label="Original Data")
    plt.plot(t, y-yd, "--r", label="Trend")
    plt.axis("tight")
    plt.legend(loc=0)
    plt.title(title)
    plt.show()
    return

In [None]:
for s in train['store'].unique():
    tmpdf = train[train['store']==s]
    # for i in tmpdf['item'].unique():
    for i in range(1,3):
        tmp2df = tmpdf[tmpdf['item']==i]
        tsplot(tmp2df, "store ID {} and item ID {}".format(s,i))

## Conclusion of EDA

- 10 different stores and 50 different items
- Training period : 2013-01-01 to 2017-12-31
- Test period: 2018-01-01 to 2018-03-31
- No missing data
- Given data (stores sales data and items sales data) are stacked into one column
- sales data is increasing year by year
- Monday is lowest sales day. Sunday is highest sales day.
- Most store's sales is increasing
- Most item's sales is increasing
- Sales of month end is larger than other days
- Sales in summer is larger than other seasons

# Modeling approach (my base-line)

## First impression of result of EDA
- There are 10 different stores and 50 different items. Thus we have to predict 500 different value for same day. There is two approaches. One way is generate 500 different model to predict 500 different sales values. Another way is generate only one model to predict 500 different sales values.

## Gradient Boosting Decision Tree(GBDT)
- This is good baseline model in competition.
- Fortunately Desision-Tree type model can  handle such kind of data.
- However decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. Thus it is necessary to detrend time series. (Below, detrending is not yet applied)


### Data preparation

In [None]:
train.columns

In [None]:
train_X = train.copy(deep=True)
del train_X['sales']
train_y = train['sales']

In [None]:
# data extraction
test['Year'] = test.index.year
test['Month'] = test.index.month
test['Day'] = test.index.day
test['WeekOfYear'] = test.index.weekofyear
test['DayOfYear'] = test.index.dayofweek
test['is_month_start'] = test.index.is_month_start
test['is_month_end'] = test.index.is_month_end
test['is_month_end'] = test.index.is_month_end
test['days_from_epoch'] = (test.index - pd.Timestamp("1970-01-01")).days

In [None]:
test_X = test.copy(deep=True)
del test_X['id']
test_X.columns

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

clf = GradientBoostingRegressor(n_estimators=100, learning_rate=0.25,
        max_depth=1).fit(train_X, train_y)

In [None]:
pred_y = clf.predict(test_X)

In [None]:
print ("Predict ",pred_y)

In [None]:
# Write submission file
out_df = pd.DataFrame({'id': test['id'].astype(np.int32), 'sales': pred_y})
out_df.to_csv('submission.csv', index=False)

References:
- https://www.kaggle.com/elenapetrova/time-series-analysis-and-forecasts-with-prophet
- https://petolau.github.io/Regression-trees-for-forecasting-time-series-in-R/
