# Store Item Demand Forecasting Challenge

<blockquote>The objective of this competition is to predict 3 months of item-level sales data at different store locations.  

This competition is provided as a way to explore different time series techniques on a relatively simple and clean dataset.  

You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.  

What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost?
</blockquote>

See [Kaggle page](https://www.kaggle.com/c/demand-forecasting-kernels-only)

In [1]:
import math
import pandas as pd
import numpy as np
import os

# Do not use normal form (scietific notation) when printing numbers, exponents can make it harder to compare values
pd.set_option('float_format', '{:f}'.format)

import seaborn as sns
#%pylab inline
# pylab.rcParams['figure.figsize'] = (15, 6)

import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import GridSearchCV

In [2]:
'numpy: {}, pandas: {}, sklearn: {}'.format(np.__version__, pd.__version__, sklearn.__version__)

'numpy: 1.14.5, pandas: 0.23.3, sklearn: 0.19.1'

## Download

In [3]:
competition_name = "demand-forecasting-kernels-only"
data_path = "..\datasets\kaggle\demand-forecasting"

In [None]:
if not os.path.isdir(data_path):
    print("creating directory '{}'".format(data_path))
    os.makedirs(data_path)

In [20]:
!kaggle competitions download -c $competition_name -p $data_path

Downloading sample_submission.csv.zip to ..\datasets\kaggle\demand-forecasting

Downloading test.csv.zip to ..\datasets\kaggle\demand-forecasting

Downloading train.csv.zip to ..\datasets\kaggle\demand-forecasting




  0%|          | 0.00/101k [00:00<?, ?B/s]
100%|##########| 101k/101k [00:00<00:00, 1.04MB/s]

  0%|          | 0.00/135k [00:00<?, ?B/s]
100%|##########| 135k/135k [00:00<00:00, 1.18MB/s]

  0%|          | 0.00/3.08M [00:00<?, ?B/s]
 32%|###2      | 1.00M/3.08M [00:00<00:00, 2.34MB/s]
 65%|######4   | 2.00M/3.08M [00:00<00:00, 2.36MB/s]
 97%|#########7| 3.00M/3.08M [00:02<00:00, 1.14MB/s]
100%|##########| 3.08M/3.08M [00:02<00:00, 1.09MB/s]


In [21]:
import tarfile

def extract_zip(filename):
    print("unzipping {} to {}".format(filename, data_path))
    !python -m zipfile -e $filename $data_path

Unzip all zip files

In [22]:
import glob

for zip_file in glob.glob(os.path.join(data_path, "*.zip")):
    extract_zip(zip_file)

unzipping ..\datasets\kaggle\demand-forecasting\sample_submission.csv.zip to ..\datasets\kaggle\demand-forecasting
unzipping ..\datasets\kaggle\demand-forecasting\test.csv.zip to ..\datasets\kaggle\demand-forecasting
unzipping ..\datasets\kaggle\demand-forecasting\train.csv.zip to ..\datasets\kaggle\demand-forecasting


In [23]:
!DEL /Q "$data_path\*.zip"

## Import

In [4]:
def read_data(filename):
    return pd.read_csv(os.path.join(data_path, filename))

In [5]:
train_data = read_data("train.csv")
test_data = read_data("test.csv")

In [6]:
X_train = train_data.drop(["sales"], axis=1)
y_train = train_data["sales"].copy()
X_test = test_data.copy()

## Explore

In [7]:
X_train.head()

Unnamed: 0,date,store,item
0,2013-01-01,1,1
1,2013-01-02,1,1
2,2013-01-03,1,1
3,2013-01-04,1,1
4,2013-01-05,1,1


In [8]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 913000 entries, 0 to 912999
Data columns (total 3 columns):
date     913000 non-null object
store    913000 non-null int64
item     913000 non-null int64
dtypes: int64(2), object(1)
memory usage: 20.9+ MB


In [9]:
X_train.describe()

Unnamed: 0,store,item
count,913000.0,913000.0
mean,5.5,25.5
std,2.872283,14.430878
min,1.0,1.0
25%,3.0,13.0
50%,5.5,25.5
75%,8.0,38.0
max,10.0,50.0


In [10]:
X_train.shape

(913000, 3)

## Train

## Evaluate

## Submit

In [None]:
submission = pd.DataFrame({
  "PassengerId": X_test.index,
  "Survived": y_hat_gb
})

In [None]:
submission.to_csv("../datasets/kaggle/titanic/submission.csv", index=False)

In [None]:
!kaggle competitions submit -c $competition_name -f $data_path/submission.csv -m "Use gradient tree boosting"`