**Table of contents**

* [Introduction](#Introduction)
* [Preparation](#Preparation)
  * [Dependencies](#Dependencies)
  * [Load the datasets](#Load-the-datasets)
* [ARIMA](#ARIMA)
* [Time series data exploration](#Time-series-data-exploration)
  * [Distribution of sales](#Distribution-of-sales)
  * [How does sales vary across stores](#How-does-sales-vary-across-stores)
  * [How does sales vary across items](#How-does-sales-vary-across-items)
  * [Time-series visualization of the sales](#Time-series-visualization-of-the-sales)

# Introduction

Kernel for the [demand forecasting](https://www.kaggle.com/c/demand-forecasting-kernels-only) Kaggle competition.

Answer some of the questions posed:

* What's the best way to deal with seasonality?
* Should stores be modeled separately, or can you pool them together?
* Does deep learning work better than ARIMA?
* Can either beat xgboost?



  
  # Preparation
  
  ## Dependencies

In [78]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
sns.set()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import statsmodels.api as sm
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split

import warnings
# import the_module_that_warns

warnings.filterwarnings("ignore")



## Load the datasets

In [79]:
# Input data files are available in the "../input/" directory.
# First let us load the datasets into different Dataframes
def load_data(datapath):
    data = pd.read_csv(datapath)
   # Dimensions
    print('Shape:', data.shape)
    # Set of features we have are: date, store, and item
    display(data.sample(10))
    return data
    
    
train_df = load_data('../input/demand-forecasting-kernels-only/train.csv')
test_df = load_data('../input/demand-forecasting-kernels-only/test.csv')
sample_df = load_data('../input/demand-forecasting-kernels-only/sample_submission.csv')

Shape: (913000, 4)


Unnamed: 0,date,store,item,sales
56400,2017-06-09,1,4,25
377070,2015-07-04,7,21,37
692067,2013-01-14,10,38,39
803216,2017-05-22,10,44,41
657497,2013-05-18,1,37,29
45851,2013-07-21,6,3,38
846239,2015-03-13,4,47,18
341447,2017-12-17,7,19,33
196126,2015-01-15,8,11,63
128020,2013-07-20,1,8,78


Shape: (45000, 4)


Unnamed: 0,id,date,store,item
10629,10629,2018-01-10,9,12
36133,36133,2018-02-13,2,41
22203,22203,2018-03-05,7,25
33578,33578,2018-01-09,4,38
6563,6563,2018-03-25,3,8
10984,10984,2018-01-05,3,13
9367,9367,2018-01-08,5,11
20890,20890,2018-01-11,3,24
33238,33238,2018-01-29,10,37
2714,2714,2018-01-15,1,4


Shape: (45000, 2)


Unnamed: 0,id,sales
32930,32930,52
30838,30838,52
35820,35820,52
1933,1933,52
26282,26282,52
34699,34699,52
25303,25303,52
34396,34396,52
34224,34224,52
28559,28559,52


# ARIMA

ARIMA is Autoregressive Integrated Moving Average Model, which is a component of SARIMAX, i.e. Seasonal ARIMA with eXogenous regressors.

(sources: [1](https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/), [2](https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3), [3](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases))


http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

# LIGHTGBM

In [80]:
def split_data(train_data,test_data):
    train_data['date'] = pd.to_datetime(train_data['date'])
    test_data['date'] = pd.to_datetime(test_data['date'])

    train_data['month'] = train_data['date'].dt.month
    train_data['day'] = train_data['date'].dt.dayofweek
    train_data['year'] = train_data['date'].dt.year

    test_data['month'] = test_data['date'].dt.month
    test_data['day'] = test_data['date'].dt.dayofweek
    test_data['year'] = test_data['date'].dt.year

    col = [i for i in test_data.columns if i not in ['date','id']]
    y = 'sales'
    train_x, test_x, train_y, test_y = train_test_split(train_data[col],train_data[y], test_size=0.2, random_state=2018)
    return (train_x, test_x, train_y, test_y,col)

train_x, test_x, train_y, test_y,col = split_data(train_df,test_df)

In [81]:
train_x.shape,train_y.shape,test_x.shape

((730400, 5), (730400,), (182600, 5))

In [82]:
%%time

def model(train_x,train_y,test_x,test_y,col):
    params = {
        'nthread': 10,
         'max_depth': 5,
#         'max_depth': 9,
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'regression_l1',
        'metric': 'mape', # this is abs(a-e)/max(1,a)
#         'num_leaves': 39,
        'num_leaves': 64,
        'learning_rate': 0.2,
       'feature_fraction': 0.9,
#         'feature_fraction': 0.8108472661400657,
#         'bagging_fraction': 0.9837558288375402,
       'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'lambda_l1': 3.097758978478437,
        'lambda_l2': 2.9482537987198496,
#       'lambda_l1': 0.06,
#       'lambda_l2': 0.1,
        'verbose': 1,
        'min_child_weight': 6.996211413900573,
        'min_split_gain': 0.037310344962162616,
        }
    
    lgb_train = lgb.Dataset(train_x,train_y)
    lgb_valid = lgb.Dataset(test_x,test_y)
    model = lgb.train(params, lgb_train, 3000, valid_sets=[lgb_train, lgb_valid], callbacks=[lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(50)])
    y_test = model.predict(test_df[col])
    return y_test,model

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.2 µs


In [83]:
%%time
y_test, model = model(train_x,train_y,test_x,test_y,col)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002536 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 88
[LightGBM] [Info] Number of data points in the train set: 730400, number of used features: 5
[LightGBM] [Info] Start training from score 47.000000
Training until validation scores don't improve for 50 rounds
[50]	training's mape: 0.185566	valid_1's mape: 0.185849
[100]	training's mape: 0.153625	valid_1's mape: 0.153903
[150]	training's mape: 0.14551	valid_1's mape: 0.145975
[200]	training's mape: 0.140568	valid_1's mape: 0.141232
[250]	training's mape: 0.137945	valid_1's mape: 0.138774
[300]	training's mape: 0.135864	valid_1's mape: 0.136774
[350]	training's mape: 0.134765	valid_1's mape: 0.135823
[400]	training's mape: 0.133891	valid_1's mape: 0.135063
[450]	training's mape: 0.133372	valid_1's mape: 0.134643
[500]	training's mape: 

In [84]:
#predict a row
model.predict(test_df[col].head(1).values)

array([11.75558308])

In [85]:
from joblib import dump
dump(model, '../model_deploy/model.joblib')


<lightgbm.basic.Booster at 0x3176da650>