# This notebook is derived from this notebook
https://www.kaggle.com/istnnrhk/demand-prediction-for-multi-store-and-multi-item

Many thanks to istnnrhk! Please upvote that notebook if you liked this notebook!

In [None]:
!pip install git+https://github.com/AutoViML/Auto_TS.git

# Demand prediction for multi-store and multi-item

This kernel is for Kaggle's Store Item Demand Forecasting Challenge

## Data Description
The objective of this competition is to predict 3 months of item-level sales data at different store locations.

File descriptions

* train.csv - Training data
* test.csv - Test data (Note: the Public/Private split is time based)
* sample_submission.csv - a sample submission file in the correct format

Data fields

* date - Date of the sale data. There are no holiday effects or store closures.
* store - Store ID
* item - Item ID
* sales - Number of items sold at a particular store on a particular date.

# Exploratory Data Analysis (Data understanding)

- Quick viewing of given raw data before importing
- First glance at given data set
 - Check shape of data, columns, index
 - Viewing raw data
 - Check NaN
 - Check describe
- Pivotal analysis
- Check ECDF: empirical cumulative distribution function
- Check Histgram
- Check trend
- Check timeseries plot
- Conclusion of EDA

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Quick viewing of given raw data before importing

### result of quick viewing
- Data have header
- train.csv has three columns
- test.csv data has three columns, but has ID column instead of sales
- sample_submission.csv has two columns. it's id and sales.
- Number of rows of test.csv and number of rows of sample_submission.csv are same. 
- Maybe, test.csv is test_X, and sample_submission.sales is test_y.
- training period : 2013-01-01 to 2017-12-31 (5 years)
- test period : 2018-01-01 to 2018-03-31 (3 month)

# First glance at given data set
In this section we go through given data, handle missing values

In [None]:
# import related libraries

# dates
from pandas import datetime

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns # advanced vizs
%matplotlib inline

# statistics
from statsmodels.distributions.empirical_distribution import ECDF

# time series analysis
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
pd.set_option("display.max_rows", 20)

# prophet by Facebook
from fbprophet import Prophet

In [None]:
# Import data
train_data_csv = "../input/train.csv"
test_data_csv = "../input/test.csv"
sample_submission_csv = "../input/sample_submission.csv"

train = pd.read_csv(train_data_csv, parse_dates = True,
                    low_memory = False, index_col = 'date')
test = pd.read_csv(test_data_csv, parse_dates = True,
                   low_memory = False, index_col = 'date')
submission = pd.read_csv(sample_submission_csv)
print("Check imported data")
print()
print("In total:")
print("train.shape {} ".format(train.shape))
print("test.shape {} ".format(test.shape))
print("submission.shape {} ".format(submission.shape))
print()
print("train.columns {} ".format(train.columns))
print("test.colmuns {} ".format(test.columns))
print("submission.colmuns {} ".format(submission.columns))
print()
print("train.index {} ".format(train.index))
print("test.index {} ".format(test.index))
print("submission.index {} ".format(submission.index))

### Viewing raw data
It is important.

In [None]:
train.head()

### Check NaN

In [None]:
# rows which contains NA column
print(train[train.isna().any(axis=1)].shape)
# rows which contains NA column
print(test[test.isna().any(axis=1)].shape)

## ECDF: empirical cumulative distribution function

In [None]:
sns.set(style = "ticks")# to format into seaborn 
c = '#386B7F' # basic color for plots
plt.figure(figsize = (12, 13))

plt.subplot(311)
cdf = ECDF(train['store'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('store'); plt.ylabel('ECDF');

plt.subplot(312)
cdf = ECDF(train['item'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('item'); plt.ylabel('ECDF');

plt.subplot(313)
cdf = ECDF(train['sales'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('sales'); plt.ylabel('ECDF');



## Conclusion of EDA

- 10 different stores and 50 different items
- Training period : 2013-01-01 to 2017-12-31
- Test period: 2018-01-01 to 2018-03-31
- No missing data
- Given data (stores sales data and items sales data) are stacked into one column
- sales data is increasing year by year
- Monday is lowest sales day. Sunday is highest sales day.
- Most store's sales is increasing
- Most item's sales is increasing
- Sales of month end is larger than other days
- Sales in summer is larger than other seasons

# Modeling approach (comparing GBT vs Auto_TS)

## First impression of result of EDA
- There are 10 different stores and 50 different items. Thus we have to predict 500 different value for same day. There is two approaches. One way is generate 500 different model to predict 500 different sales values. Another way is generate only one model to predict 500 different sales values.

## Gradient Boosting Decision Tree(GBDT)
- This is good baseline model in competition.
- Fortunately Desision-Tree type model can  handle such kind of data.
- However decision tree does not compute any regression coefficients like linear regression, so trend modeling is not possible. Thus it is necessary to detrend time series. (Below, detrending is not yet applied)


### Let's compare Auto_TS to a simple GBT Model

In [None]:
#!pip install auto-ts

In [None]:
train_X = train.copy(deep=True)
del train_X['sales']
train_y = train['sales']

In [None]:
trainx = train.reset_index()
trainx.head()

In [None]:
test_X = test.copy(deep=True)
del test_X['id']
test_X.columns

In [None]:
testx = test.drop('id',axis=1)
testx = testx.reset_index()
testx.head()

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

clf = GradientBoostingRegressor(n_estimators=100, learning_rate=0.25,
        max_depth=1)
clf.fit(train_X, train_y)

In [None]:
pred_y = clf.predict(test_X)
print ("Predict ",pred_y)

In [None]:
from auto_ts import auto_timeseries

In [None]:
model = auto_timeseries(score_type='rmse',forecast_period=100,
                time_interval='D',
                non_seasonal_pdq=None, seasonality=False, seasonal_period=1,
                model_type=['ML'],
                verbose=2)

In [None]:
ts_column = 'date'
target = 'sales'

In [None]:
model.fit(train, ts_column,target)

In [None]:
predictions = model.predict(
            testdata=testx,
            model='ML',
        )

In [None]:
pred_x = predictions['yhat'].values
pred_x

In [None]:
out_df = pd.DataFrame({'id': test['id'].astype(np.int32), 'sales': pred_x})
out_df.to_csv('submission.csv', index=False)

In [None]:
# Write submission file
out_df = pd.DataFrame({'id': test['id'].astype(np.int32), 'sales': pred_y})
out_df.to_csv('submissiony.csv', index=False)

References:
- https://www.kaggle.com/elenapetrova/time-series-analysis-and-forecasts-with-prophet
- https://petolau.github.io/Regression-trees-for-forecasting-time-series-in-R/
