Need to forecast item sales at different stories for following 3 months

You can find LightGBM features in [here](https://lightgbm.readthedocs.io/en/latest/Features.html).
We will use GBM method such as LightGBM. There will be some important steps before modeling.
Normally LightGBM does not understand time series.
But we will put it in a way that it can understand.
There is such a thing as a trend, there is such a thing as stationary and seasonality.
We can use whatever ML method we want. However, the features we will extract must carry the pattern of our data.


In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import make_scorer
warnings.filterwarnings('ignore')

# Data Importing

In [None]:
train = pd.read_csv("../input/demand-forecasting-kernels-only/train.csv", parse_dates=["date"])
test = pd.read_csv("../input/demand-forecasting-kernels-only/test.csv", parse_dates=["date"])
sample_sub = pd.read_csv('../input/demand-forecasting-kernels-only/sample_submission.csv')
df = pd.concat([train, test], sort=False)

# EDA

We discover dataset as looking shape, types of features, null values etc.
Defined basic function you can use another projects also

In [None]:

def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)


check_df(df)

In [None]:
print(train.shape, test.shape, df.shape)

In [None]:
# Our time range of sales record.
df["date"].min(), df["date"].max()

In [None]:
# Summary Stats for each store
df.groupby(["store"]).agg({"sales": ["count", "sum", "mean", "median", "std", "min", "max"]})

## Visualize sales in each stores

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(20, 10))
for i in range(1, 11):
    if i < 6:
        train[train.store == i].sales.hist(ax=axes[0, i - 1])
        axes[0, i - 1].set_title("Store " + str(i), fontsize=15)

    else:
        train[train.store == i].sales.hist(ax=axes[1, i - 6])
        axes[1, i - 6].set_title("Store " + str(i), fontsize=15)
plt.tight_layout(pad=4.5)
plt.suptitle("Sales in Stores (Histogram)");

# Feature Engineering
There may be seasonality in time units.
We derive date-based features by breaking as much as possible.

* Time Related Features (Date Features)
* Shifted Features
* Rolling Mean Features (Moving Average Features)
* Exponentially Weighted Mean Features

## Time Related Features (Date Features)

In [None]:
def create_date_features(df):
    df['month'] = df.date.dt.month
    df['day_of_month'] = df.date.dt.day
    df['day_of_year'] = df.date.dt.dayofyear
    df['week_of_year'] = df.date.dt.weekofyear
    df['day_of_week'] = df.date.dt.dayofweek
    df['year'] = df.date.dt.year
    df["is_wknd"] = df.date.dt.weekday // 4
    df["quarter"] = df.date.dt.quarter
    df['is_month_start'] = df.date.dt.is_month_start.astype(int)
    df['is_month_end'] = df.date.dt.is_month_end.astype(int)
    df["season"] = np.where(df.month.isin([12,1,2]), 0, 1)  # 0: Winter, 1: Spring
    df["season"] = np.where(df.month.isin([6,7,8]), 2, df["season"]) # 2: Summer
    df["season"] = np.where(df.month.isin([9, 10, 11]), 3, df["season"])  # 3: Fall
    return df

df = create_date_features(df)

## Random noise
We will derive the shifted features with rolling mean.
When we derive them, we add random noise to the data because it affects the generalizability ability
when it is the same as the previous true value. In a way, we corrupt the data ourselves.
In the size of the data set with random normal distribution, we will add to the features we want.
It is very useful for time series problems.

In [None]:
def random_noise(dataframe):
    return np.random.normal(scale=1.6, size=(len(dataframe),))

## Lag/Shifted Features

Past real values.
Since the future is best influenced by the most recent last period values, we turn them into features.
We produce a feature for past sales values. We add previous sales features.
In other words, we create an independent variable using the dependent variable.
That's why we just added random noise.

In [None]:
# We don't want the values to be mixed sorted because I will derive a lag. That's why we're sorting it.
df.sort_values(by=['store', 'item', 'date'], axis=0, inplace=True)


def lag_features(dataframe, lags):
    for lag in lags:
        dataframe['sales_lag_' + str(lag)] = dataframe.groupby(["store", "item"])['sales'].transform(
            lambda x: x.shift(lag)) + random_noise(dataframe)
    return dataframe

# The problem with my dataset was that we wanted to make an estimate for the next 3 months(90 days).
# Therefore, since my forecast period is 3 months, I determine my new lag features accordingly.
# We want to catch seasonality.
df = lag_features(df, [91, 98, 105, 112, 119, 126, 182, 364, 546, 728])

## Rolling Mean Features
Better to shift first and then average. When deriving the rolling mean feature, take shift and avg.


In [None]:
def roll_mean_features(dataframe, windows):
    for window in windows:
        dataframe['sales_roll_mean_' + str(window)] = dataframe.groupby(["store", "item"])['sales']. \
                                                          transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=10, win_type="triang").mean()) + random_noise(
            dataframe)
    return dataframe


df = roll_mean_features(df, [365, 546])

## Exponentially Weighted Mean Features
It makes more sense to take the moving average instead of past average it if there is seasonality and trend
The weight we give to the closest value, which we call ewm. If it is high (alpha=0.99), it gives higher weight to the nearest period.


In [None]:
def ewm_features(dataframe, alphas, lags):
    for alpha in alphas:
        for lag in lags:
            dataframe['sales_ewm_alpha_' + str(alpha).replace(".", "") + "_lag_" + str(lag)] = \
                dataframe.groupby(["store", "item"])['sales'].transform(lambda x: x.shift(lag).ewm(alpha=alpha).mean())
    return dataframe


alphas = [0.95, 0.9, 0.8, 0.7, 0.5]
lags = [91, 98, 105, 112, 180, 270, 365, 546, 728]
df = ewm_features(df, alphas, lags)

### One-Hot Encoding

In [None]:
# df = pd.get_dummies(df, columns=['store', 'item', 'day_of_week', 'month']) 
# not needed for now

## Converting sales to log(1+sales)
We can do conversion to the dependent variable. We'll take that back when evaluating the error.
We got the logarithm of the dependent variable. Why?
Since we will use LightGBM. We want to shorten the optimization time as it is based on GBM.
I am interested in the dependent variable. We are in a regression problem.
Therefore, we can the logarithm and reduce it and standardize it. But now I don't do it.

In [None]:
df['sales'] = np.log1p(df["sales"].values)

# Modeling

## Custom Cost Function

* MAE: mean absolute error
* MAPE: mean absolute percentage error
* SMAPE: Symmetric mean absolute percentage error (adjusted MAPE)

In [None]:
def smape(preds, target):
    n = len(preds)
    masked_arr = ~((preds == 0) & (target == 0))
    preds, target = preds[masked_arr], target[masked_arr]
    num = np.abs(preds - target)
    denom = np.abs(preds) + np.abs(target)
    smape_val = (200 * np.sum(num / denom)) / n
    return smape_val


def lgbm_smape(preds, train_data):
    labels = train_data.get_label()
    smape_val = smape(np.expm1(preds), np.expm1(labels))
    return 'SMAPE', smape_val, False

## Time-Based Validation Sets


We have one train set, and we need to have one validation set. We have to look at our own fault with it.

In [None]:
# I'm trying to make my reference point for validation look like Kaggle's test data scenario.
# Dataset gave us till the first 3 months of 2018. For this reason, we do validation with the first 3 months of 2017.
# Train until 2017, validation for the first 3 months of 2017.
train = df.loc[(df["date"] < "2017-01-01"), :]
val = df.loc[(df["date"] >= "2017-01-01") & (df["date"] < "2017-04-01"), ]


cols = [col for col in train.columns if col not in ['date', 'id', "sales", "year"]]

Y_train = train['sales']
X_train = train[cols]

Y_val = val['sales']
X_val = val[cols]

Y_train.shape, X_train.shape, Y_val.shape, X_val.shape
# There are 45000 values in my validation set. My test set also had 45000 values.

# Hyperparameter Tuning

In [None]:
#lgbm_params = {"num_leaves": [10,15,20,31],
#               "learning_rate": [0.1, 0.05, 0.02],
#               "colsample_bytree":[0.5, 0.8, 1.0],
#               "max_depth": [-1, 5, 10, 20]}

In [None]:
# model = lgb.LGBMRegressor()
# tscv = TimeSeriesSplit(n_splits=3)
# rsearch = GridSearchCV(model, lgbm_params, cv=tscv, scoring=make_scorer(smape), verbose = True, n_jobs = -1).fit( X_train[cols], Y_train )

In [None]:
# print(rsearch.best_params_)

> {'colsample_bytree': 0.5, 'learning_rate': 0.02, 'max_depth': 5, 'num_leaves': 10}

In [None]:
lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.5,
              'max_depth': 5,
              'verbose': 0,
              'num_boost_round': 1000,
              'early_stopping_rounds': 200,
              'nthread': -1}

In [None]:
lgbtrain = lgb.Dataset(data=X_train, label=Y_train, feature_name=cols)
lgbval = lgb.Dataset(data=X_val, label=Y_val, reference=lgbtrain, feature_name=cols)


model = lgb.train(lgb_params, lgbtrain,
                  valid_sets=[lgbtrain, lgbval],
                  num_boost_round=lgb_params['num_boost_round'],
                  early_stopping_rounds=lgb_params['early_stopping_rounds'],
                  feval=lgbm_smape,
                  verbose_eval=100)

> [8700]	training's l1: 0.12471	training's SMAPE: 12.8089	valid_1's l1: 0.131423	valid_1's SMAPE: 13.5114

In [None]:
y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)

In [None]:
# We undo the logarithmic transformation we did. Then we calculated our smape error.
smape(np.expm1(y_pred_val), np.expm1(Y_val))

In [None]:

def plot_lgb_importances(model, plot=False, num=10):

    gain = model.feature_importance('gain')
    feat_imp = pd.DataFrame({'feature': model.feature_name(),
                             'split': model.feature_importance('split'),
                             'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
    if plot:
        plt.figure(figsize=(10, 10))
        sns.set(font_scale=1)
        sns.barplot(x="gain", y="feature", data=feat_imp[0:25])
        plt.title('feature')
        plt.tight_layout()
        plt.show()
    else:
        print(feat_imp.head(num))
        return feat_imp

plot_lgb_importances(model, num=30, plot=True)

In [None]:
feature_imp_df = plot_lgb_importances(model, num=50)
feature_imp_df.gain

In [None]:
cols = feature_imp_df[feature_imp_df.gain > 0.015].feature.tolist()
print("Independent Variables:", len(cols))

# Final Model

In [None]:
train_final = df.loc[~df.sales.isna()] 
Y_train_final = train_final['sales']
X_train_final = train_final[cols]

test_final = df.loc[df.sales.isna()]
X_test_final = test_final[cols]

In [None]:

lgb_params = {'metric': {'mae'},
              'num_leaves': 10,
              'learning_rate': 0.02,
              'feature_fraction': 0.5,
              'max_depth': 5,
              'verbose': 0,
              'nthread': -1,
              "num_boost_round": model.best_iteration}

In [None]:
lgbtrain_all = lgb.Dataset(data=X_train_final, label=Y_train_final, feature_name=cols)

In [None]:
model = lgb.train(lgb_params, lgbtrain_all, num_boost_round=model.best_iteration)

In [None]:
test_preds = model.predict(X_test_final, num_iteration=model.best_iteration)

## Define Forecast

In [None]:
forecast = pd.DataFrame({
    "date":test_final.date,
    "store":test_final.store,
    "item":test_final.item,
    "sales":np.expm1(test_preds)
})

## Submission

In [None]:
submission_df = test_final.loc[:, ['id', 'sales']]
submission_df['sales'] = np.expm1(test_preds)
submission_df['id'] = submission_df.id.astype(int)
submission_df.to_csv('submission_demand3.csv', index=False)
submission_df.head()

In [None]:
submission_df[["sales"]].describe([0.1, 0.75, 0.8, 0.9, 0.95, 0.99]).T

## Visualize Sales Forecast

In [None]:
train_final["sales"] = np.expm1(train_final["sales"])

In [None]:
train_final[(train_final.store == 1) & (train_final.item == 1)].set_index("date").sales.plot(figsize = (20,9),legend=True, label = "Store 1 Item 1 Sales")
forecast[(forecast.store == 1) & (forecast.item == 1)].set_index("date").sales.plot(legend=True, label = "Store 1 Item 1 Forecast", color ="orange");