## Demand Forecasting with using Big Data

## Content
1. [Introduction](#section-intro)
2. [Importing libraries and Kaggle setup](#section-ts)
3. [Load Dataset](#section-pro)
4. [Basic Exploratory Data Analysis](#section-ten)
5. [Feature Engineering](#section-ten)
6. [Data Encoding](#section-ten)
7. [LightGB Model](#section-ten)
8. [Time Series Analysis](#section-ten)   

## 1. Introduction

#### Dataset Overview
* A store chain's 5-year data includes information on 10 different stores and 50 different products.
* The data set covers the period between 01-01-2013 and 31-12-2017.

#### Business Problem
* It is desired to create a 3-month demand forecasting model for 10 different stores and 50 different products of a chain of stores.
* Afterwards, it is desired to reduce the data set to weekly and create a demand forecasting model for 2017.

#### Variables
* date – Date of sales data (No holiday effects or store closures)
* store - Store ID Unique number for each store.
* item - Item ID Unique number for each item.
* sales – Number of items sold, Number of items sold from a particular store on a given date

## 2. Importing Libraries and Kaggle Setup

In [None]:
# Importing Libraries

import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns
import lightgbm as lgb
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm
import itertools

import warnings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
warnings.filterwarnings('ignore')

#Kaggle setup
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



## 3. Load Dataset

In [None]:
train = pd.read_csv('../input/demand-forecasting-kernels-only/train.csv', parse_dates=['date'])
test = pd.read_csv('../input/demand-forecasting-kernels-only/test.csv', parse_dates=['date'])
df = pd.concat([train, test], sort=False)
df.head()

In [None]:
print("Size of train set",train.shape)
print("Size of test set:",test.shape)

In [None]:
#Removing the extra column 'id'
df.drop(['id'],inplace=True,axis=1)
df.columns

## 4. Basic Exploratory data Analysis

In [None]:
#DATE RANGE

print("Date range:", df["date"].min(), "to", df["date"].max())
#1st Jan 2013 to 31st March, 2018

In [None]:
# SALES DISTRIBUTION

df["sales"].describe([0.10, 0.30, 0.50, 0.70, 0.80, 0.90, 0.95, 0.99])

In [None]:
# NUMBER OF STORES

df["store"].nunique()

In [None]:
# NUMBER OF PRODUCTS

df["item"].nunique() 

In [None]:
# NUMBER OF PRODUCTS IN EACH STORE
df.groupby(["store"])["item"].nunique()
#Every store sells all the 50 products

In [None]:
# Sales statistics in store-product breakdown
df.groupby(["store", "item"]).agg({"sales": ["sum", "mean", "median", "std"]})

## 5. Feature Engineering

In [None]:
# Generating date and time parameters from given date

df['month'] = df.date.dt.month
df['day_of_month'] = df.date.dt.day
df['day_of_year'] = df.date.dt.dayofyear 
df['week_of_year'] = df.date.dt.weekofyear
df['day_of_week'] = df.date.dt.dayofweek
df['year'] = df.date.dt.year
df["is_wknd"] = df.date.dt.weekday // 4
df['is_month_start'] = df.date.dt.is_month_start.astype(int)
df['is_month_end'] = df.date.dt.is_month_end.astype(int) 

In [None]:
df.head()

In [None]:
# Sales statistics in store-item-month breakdown
df.groupby(["store", "item", "month"]).agg({"sales": ["sum", "mean", "median", "std"]})

In [None]:
#DEALING WITH RAMDOM NOISE
# For small datasets like this dataset, random noise can be added to the values ​​to prevent overfitting.
# Here I will add Gaussian random noise, which is normally distributed with 1 standard deviation and 0 mean.

def random_noise(dataframe):
    return np.random.normal(scale=1.6, size=(len(dataframe),))

In [None]:
#Lag/Shifted Features (Delays)
df.sort_values(by=['store', 'item', 'date'], axis=0, inplace=True)
df.head(10)

In [None]:
def lag_features(dataframe, lags):
    for lag in lags:
        dataframe['sales_lag_' + str(lag)] = dataframe.groupby(["store", "item"])['sales'].transform(
            lambda x: x.shift(lag)) + random_noise(dataframe)
    return dataframe

df = lag_features(df, [91, 98, 105, 112, 119, 126, 182, 364, 546, 728])

In [None]:
#Moving Average Features
def moving_average_features(dataframe, windows):
    for window in windows:
        dataframe['sales_roll_mean_' + str(window)] = dataframe.groupby(["store", "item"])['sales']. \
                                                          transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=10, win_type="triang").mean()) + random_noise(
            dataframe)
    return dataframe


df = moving_average_features(df, [365, 546, 730])


In [None]:
#Exponentially Weighted Average Features
def ewm_features(dataframe, alphas, lags):
    for alpha in alphas:
        for lag in lags:
            dataframe['sales_ewm_alpha_' + str(alpha).replace(".", "") + "_lag_" + str(lag)] = \
                dataframe.groupby(["store", "item"])['sales'].transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    return dataframe


alphas = [0.99, 0.95, 0.9, 0.8, 0.7, 0.5]
lags = [91, 98, 105, 112, 180, 270, 365, 546, 728]

df = ewm_features(df, alphas, lags)

## 6. Data Encoding

In [None]:
#Checking for null values
df.info()

In [None]:
#One-Hot Encoding
df = pd.get_dummies(df, columns=['day_of_week', 'month'])

In [None]:
#Converting sales to log(1+sales)
df['sales'] = np.log1p(df["sales"].values)

## 7. LightGBM Model

In [None]:
# Train data set until the beginning of 2017 (end of 2016)
train = df.loc[(df["date"] < "2017-01-01"), :]

# First 3 months of 2017 validation set
val = df.loc[(df["date"] >= "2017-01-01") & (df["date"] < "2017-04-01"), :]

# Independent variables
cols = [col for col in train.columns if col not in ['date', 'id', "sales", "year"]]

In [None]:
# Selecting the dependent variable for the train set
Y_train = train['sales']

# Selecting the argument for the train set
X_train = train[cols]

# choosing the dependent variable for the validation set
Y_val = val['sales']

# selecting the independent variable for the validation set
X_val = val[cols] 

# Checking the shapes
Y_train.shape, X_train.shape, Y_val.shape, X_val.shape

In [None]:
# Custom Cost Function

def smape(preds, target):
    n = len(preds)
    masked_arr = ~((preds == 0) & (target == 0))
    preds, target = preds[masked_arr], target[masked_arr]
    num = np.abs(preds - target)
    denom = np.abs(preds) + np.abs(target)
    smape_val = (200 * np.sum(num / denom)) / n
    return smape_val


def lgbm_smape(preds, train_data):
    labels = train_data.get_label()
    smape_val = smape(np.expm1(preds), np.expm1(labels))
    return 'SMAPE', smape_val, False


In [None]:
# LightGBM parameters
lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.8,
              'max_depth': 5,
              'verbose': 0,
              'num_boost_round': 10000, 
              'early_stopping_rounds': 200,
              'nthread': -1}

In [None]:
lgbtrain = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)
lgbval = lgb.Dataset(data=X_val, label=Y_val, reference=lgbtrain, feature_name=cols)

model = lgb.train(lgb_params, lgbtrain,
                  valid_sets=[lgbtrain, lgbval],
                  num_boost_round=lgb_params['num_boost_round'],
                  early_stopping_rounds=lgb_params['early_stopping_rounds'],
                  feval=lgbm_smape, 
                  verbose_eval=100)

y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)

# percentage of validation error
smape(np.expm1(y_pred_val), np.expm1(Y_val))

### Final Model

In [None]:
# determination of test and train dependent/independent variables

train = df.loc[~df.sales.isna()]
Y_train = train['sales']
X_train = train[cols]

test = df.loc[df.sales.isna()]
X_test = test[cols]

In [None]:
lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.8,
              'max_depth': 5,
              'verbose': 0,
              'nthread': -1,
              "num_boost_round": model.best_iteration}

# LightGBM dataset
lgbtrain_all = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)

model = lgb.train(lgb_params, lgbtrain_all, num_boost_round=model.best_iteration)
test_preds = model.predict(X_test, num_iteration=model.best_iteration)

## 8. Time Series Analysis

* In this section, first of all, the train data set will be reduced to a weekly basis.
* Then, using the weekly data set, respectively:
* LightGBM Model
* Single Exponential Smoothing
* Double Exponential Smoothing
* Triple Exponential Smoothing
* ARIMA
* Sales demand forecasting models for 2017 will be created with SARIMA.
* Actual values ​​will be compared with estimated values.


### Load Dataset and EDA

In [None]:
data = pd.read_csv('../input/demand-forecasting-kernels-only/train.csv', parse_dates=['date'])
data.head()
data.shape

# reduce dataset to weekly
data.set_index("date",inplace=True)
df= data.resample("W").mean()
df.reset_index(inplace=True)
df.head()
df.shape

df.index.freq = "W"
df.head()

In [None]:
df.shape

### Feature Engineering

In [None]:
# Month
df['month'] = df.date.dt.month
# Day of Month
df['day_of_month'] = df.date.dt.day
# Day of year
df['day_of_year'] = df.date.dt.dayofyear
# Week of year
df['week_of_year'] = df.date.dt.weekofyear
# Day of week
df['day_of_week'] = df.date.dt.dayofweek
# Year
df['year'] = df.date.dt.year
# Weekend check
df["is_wknd"] = df.date.dt.weekday // 4
# Month start check
df['is_month_start'] = df.date.dt.is_month_start.astype(int)
# Month end check
df['is_month_end'] = df.date.dt.is_month_end.astype(int)

# Lag/Shifted Features (Delays)
def lag_features(dataframe, lags):
    for lag in lags:
        dataframe['sales_lag_' + str(lag)] = dataframe['sales'].transform(
            lambda x: x.shift(lag)) + random_noise(dataframe)
    return dataframe

df = lag_features(df, [31, 61, 91, 98, 105, 112])


# Moving Average Features
def roll_mean_features(dataframe, windows):
    for window in windows:
        dataframe['sales_roll_mean_' + str(window)] = dataframe['sales']. \
                                                          transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=10, win_type="triang").mean()) + random_noise(
            dataframe)
    return dataframe


df = roll_mean_features(df, [31, 61, 91, 98, 105, 112])


# Exponentially Weighted Mean Features
def ewm_features(dataframe, alphas, lags):
    for alpha in alphas:
        for lag in lags:
            dataframe['sales_ewm_alpha_' + str(alpha).replace(".", "") + "_lag_" + str(lag)] = \
                dataframe['sales'].transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    return dataframe


alphas = [0.99, 0.95, 0.9, 0.8, 0.7, 0.5]
lags = [10, 20, 30, 40, 50]

df = ewm_features(df, alphas, lags)

df.tail()


### LightGBM Model

In [None]:
# One-Hot Encoding
df = pd.get_dummies(df, columns=['day_of_week', 'month'])

# Converting sales to log(1+sales)
df['sales'] = np.log1p(df["sales"].values)

# train-test data selection
train = df.loc[(df["date"] < "2017-01-01"), :]
test = df.loc[(df["date"] >= "2017-01-01"), :]

# Dependent and Independent variables
cols = [col for col in train.columns if col not in ['date', "sales", "year"]]
X_train = train[cols]
Y_train = train['sales']
X_test = test[cols]
Y_test = test["sales"]


In [None]:
# LightGBM parameters
lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.8,
              'max_depth': 5,
              'verbose': 0,
              'num_boost_round': 10000, 
              'early_stopping_rounds': 200, 
              'nthread': -1}

lgbtrain = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)
lgbval = lgb.Dataset(data=X_test, label=Y_test, reference=lgbtrain, feature_name=cols)

model = lgb.train(lgb_params, lgbtrain,
                  valid_sets=[lgbtrain, lgbval],
                  num_boost_round=lgb_params['num_boost_round'],
                  early_stopping_rounds=lgb_params['early_stopping_rounds'],
                  feval=lgbm_smape, 
                  verbose_eval=100)

y_pred_test = model.predict(X_test, num_iteration=model.best_iteration)

# percentage of test error
smape(np.expm1(y_pred_test), np.expm1(Y_test))


### Feature Importance

In [None]:
def plot_lgb_importances(model, plot=False, num=10):

    gain = model.feature_importance('gain')
    feat_imp = pd.DataFrame({'feature': model.feature_name(),
                             'split': model.feature_importance('split'),
                             'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    if plot:
        plt.figure(figsize=(10, 10))
        sns.set(font_scale=1)
        sns.barplot(x="gain", y="feature", data=feat_imp[0:25])
        plt.title('feature')
        plt.tight_layout()
        plt.show()
    else:
        print(feat_imp.head(num))


plot_lgb_importances(model, num=30)
#plot_lgb_importances(model, num=30, plot=True)

lgb.plot_importance(model, max_num_features=20, figsize=(10, 10), importance_type="gain")
plt.show()


In [None]:
# Final Model

lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.8,
              'max_depth': 5,
              'verbose': 0,
              'nthread': -1,
              "num_boost_round": model.best_iteration}

# LightGBM dataset
lgbtrain_all = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)

model = lgb.train(lgb_params, lgbtrain_all, num_boost_round=model.best_iteration)
test_preds = model.predict(X_test, num_iteration=model.best_iteration)


## Here is the prediction!

In [None]:
# 1 year actual and predicted values
forecast = pd.DataFrame({"date":test["date"],
                        "store":test["store"],
                        "item":test["item"],
                        "sales":test_preds
                        })

df.set_index("date").sales.plot(figsize = (20,9),legend=True, label = "Actual")
forecast.set_index("date").sales.plot(legend=True, label = "Predict")
plt.show()