# <center>데이터마이닝 개론</center>
# <center>(Introduction to data mining)</center>
# <center> <font color='blue'>Term Project</font></center>

## <font color='blue'>Topic:</font> Store Sales - Time Series Forecasting Use machine learning to predict grocery sales
<hr/>

### Supervisor: <font color='blue'>Prof. 김재환</font>
### Student: <font color='blue'>TRAN DUY THANH</font>
### ID           : <font color='blue'> 20207144</font>

<hr/>


# Goal of the Term Project:

This term Project is from the international competition. The competitors will use the time-series forecasting to forecast store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer.
Specifically, the competitors will build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores. The competitors will practice their machine learning skills with an approachable training dataset of dates, store, and item information, promotions, and unit sales.

# Dataset Description:
The Datasets of this international competition are described as below.
## The train.csv
•	The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.

•	store_nbr identifies the store at which the products are sold.

•	family identifies the type of product sold.

•	sales give the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).

•	on promotion gives the total number of items in a product family that were being promoted at a store at a given date.


## The test.csv file

•	The test data, having the same features as the training data. You will predict the target sales for the dates in this file.

•	The dates in the test data are for the 15 days after the last date in the training data.

## The holidays_events.csv file 

•	Holidays and Events, with metadata

•	Additional holidays are days added a regular calendar holiday, for example, as typically happens around Christmas (making Christmas Eve a holiday).



This block code is initial the library and load the dataset

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())

# Setup notebook
from pathlib import Path
# plot style settings
from learntools.time_series.style import *  

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

dataset="../input/store-sales-time-series-forecasting/train.csv"

store_sales=pd.read_csv(dataset,
                       parse_dates=["date"],
                        infer_datetime_format=True
                       )
store_sales = store_sales.set_index('date').to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family'], append=True)
average_sales = store_sales.groupby('date').mean()['sales'].loc['2016':'2017']

In [None]:
store_sales

View the data of average_sales:

In [None]:
average_sales 

# 1) Determine trend with a moving average plot

Corporación Favorita SalesCorporación Favorita Sales.

In [None]:
ax = average_sales.plot(**plot_params)
ax.set(title="Corporación Favorita Sales", 
       ylabel="Number of Sales");

We'll continue using the time series of average sales. Run this cell to see a moving average plot of average_sales estimating the trend.

In [None]:
trend = average_sales.rolling(
    window=365,
    center=True,
    min_periods=183,
).mean()

ax = average_sales.plot(**plot_params, alpha=0.5)
ax = trend.plot(ax=ax, linewidth=3)

# 2) Create a Trend Feature

Use DeterministicProcess to create a feature set for a cubic trend model. Also create features for a 90-day forecast.

In [None]:
from statsmodels.tsa.deterministic import DeterministicProcess

#set the target
y = average_sales.copy()  

# Instantiate `DeterministicProcess` with arguments appropriate for a cubic trend model
dp = DeterministicProcess(index=y.index, order=3)

# Create the feature set for the dates given in y.index
X = dp.in_sample()

# Create features for a 90-day forecast.
X_fore = dp.out_of_sample(steps=90)

#we can see the a plot of the result:
model = LinearRegression()
model.fit(X, y)

y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, linewidth=3, label="Trend", color='C0')
ax = y_fore.plot(ax=ax, linewidth=3, label="Trend Forecast", color='C3')
ax.legend();

# 3. Seasonality - Create indicators and Fourier features to capture periodic change


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from learntools.time_series.utils import plot_periodogram, seasonal_plot
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

dataset="../input/store-sales-time-series-forecasting/train.csv"

store_sales = pd.read_csv(
    dataset,
    usecols=['store_nbr', 'family', 'date', 'sales'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()
#I filter for 2017
average_sales = (
    store_sales
    .groupby('date').mean()
    .squeeze()
    .loc['2017']
)

Show data of average_sales

In [None]:
average_sales

## We use seasonal_plot:

In [None]:
X = average_sales.to_frame()
X["week"] = X.index.week
X["day"] = X.index.dayofweek
seasonal_plot(X, y='sales', period='week', freq='day');

## We use plot_periodogram

In [None]:
plot_periodogram(average_sales);

# Create seasonal features
Use DeterministicProcess and CalendarFourier to create:

indicators for weekly seasons and
Fourier features of order 4 for monthly seasons.

In [None]:
y = average_sales.copy()

# Create CalendarFourier
fourier = CalendarFourier(freq='M', order=4)
# Create DeterministicProcess
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

#use LinearRegression model
model = LinearRegression().fit(X, y)
y_pred = pd.Series(
    model.predict(X),
    index=X.index,
    name='Fitted',
)
#get prediction:
y_pred = pd.Series(model.predict(X), index=X.index)
ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, label="Seasonal")
ax.legend();

# 4) Time Series as Features

In [None]:
# plot style settings
from learntools.time_series.style import *  
from learntools.time_series.utils import plot_lags, make_lags, make_leads

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

dataset = '../input/store-sales-time-series-forecasting/train.csv'

store_sales = pd.read_csv(
    dataset,
    usecols=['store_nbr', 'family', 'date', 'sales', 'onpromotion'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
        'onpromotion': 'uint32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

family_sales = (
    store_sales
    .groupby(['family', 'date'])
    .mean() 
    .unstack('family')
    .loc['2017', ['sales', 'onpromotion']]
)

Show the data of family_sales 

In [None]:
family_sales

Not every product family has sales showing cyclic behavior, and neither does the series of average sales. Sales of school and office supplies, however, show patterns of growth and decay not well characterized by trend or seasons. 

Trend and seasonality will both create serial dependence that shows up in correlograms and lag plots. To isolate any purely cyclic behavior, we'll start by deseasonalizing the series. Use the code in the next cell to deseasonalize Supply Sales. We'll store the result in a variable y_deseason.


In [None]:
supply_sales = family_sales.loc(axis=1)[:, 'SCHOOL AND OFFICE SUPPLIES']
y = supply_sales.loc[:, 'sales'].squeeze()

fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    constant=True, index=y.index,
    order=1,seasonal=True,
    drop=True,
    additional_terms=[fourier],
)
X_time = dp.in_sample()
X_time['NewYearsDay'] = (X_time.index.dayofyear == 1)

model = LinearRegression(fit_intercept=False)
model.fit(X_time, y)
y_deseason = y - model.predict(X_time)
y_deseason.name = 'sales_deseasoned'

ax = y_deseason.plot()
ax.set_title("Sales of School and Office Supplies (deseasonalized)");

## Plotting cycles

Create a seven-day moving average from y, the series of supply sales. Use a centered window, but don't set the min_periods argument.

In [None]:
#create rolling
y_ma = y.rolling(7, center=True).mean()

# Plot
ax = y_ma.plot()
ax.set_title("Seven-Day Moving Average");

We can view the deseasonalized series for serial dependence. Take a look at the partial autocorrelation correlogram and lag plot.


In [None]:
plot_pacf(y_deseason, lags=8);
plot_lags(y_deseason, lags=8, nrows=2);

In [None]:
onpromotion = supply_sales.loc[:, 'onpromotion'].squeeze().rename('onpromotion')

# Drop days without promotions
plot_lags(x=onpromotion.loc[onpromotion > 1], y=y_deseason.loc[onpromotion > 1], lags=3, leads=3, nrows=1);

# 5. Hybrid Models 

Linear regression excels at extrapolating trends, but can't learn interactions. XGBoost excels at learning interactions, but can't extrapolate trends. So, we’ll research how to create “Hybrid" forecasters that combine complementary learning algorithms and let the strengths of one make up for the weakness of the other. And after that we will apply the Hybrid model to this grocery sales forecasting.

### Components and Residuals

Each of the terms in this model we would then call a component of the time series:


series = trend + seasons + cycles + error


The residuals of a model are the difference between the target the model was trained on and the predictions the model makes.



In [None]:
# plot style settings
from learntools.time_series.style import *  

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from statsmodels.tsa.deterministic import DeterministicProcess
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor


dataset='../input/store-sales-time-series-forecasting/train.csv'

store_sales = pd.read_csv(
    dataset,
    usecols=['store_nbr', 'family', 'date', 'sales', 'onpromotion'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

family_sales = (
    store_sales
    .groupby(['family', 'date'])
    .mean()
    .unstack('family')
    .loc['2017']
)

Show data of family_sales

In [None]:
family_sales


Now We will create a boosted hybrid for the *Store Sales* dataset by implementing a new Python class. the hybrid model class will have `fit` and `predict` methods to give it a scikit-learn like interface.


In [None]:
class BoostedHybrid:
    def __init__(self, model_1, model_2):
        self.model_1 = model_1
        self.model_2 = model_2
        self.y_columns = None
    def fit(self, X_1, X_2, y):
        self.model_1.fit(X_1, y)

        y_fit = pd.DataFrame(self.model_1.predict(X_1), 
            index=X_1.index, columns=y.columns)
        #compute residuals
        y_resid = y - y_fit
        y_resid = y_resid.stack().squeeze()

        #fit self.model_2 on residuals
        self.model_2.fit(X_2, y_resid)

        # Save column names for predict method
        self.y_columns = y.columns
        # Save data for question checking
        self.y_fit = y_fit
        self.y_resid = y_resid        
    def predict(self, X_1, X_2):
        y_pred = pd.DataFrame(
            #predict with self.model_1
            self.model_1.predict(X_1), 
            index=X_1.index, columns=self.y_columns)
        y_pred = y_pred.stack().squeeze()# wide to long

        #add self.model_2 predictions to y_pred
        y_pred += self.model_2.predict(X_2)
        return y_pred.unstack()  # long to wide

Now we can use the BoostedHybrid class to create a model for the Store Sales data. this code to set up the data for training.

In [None]:
# Target series
y = family_sales.loc[:, 'sales']

# X_1: Features for Linear Regression
dp = DeterministicProcess(index=y.index, order=1)
X_1 = dp.in_sample()

# X_2: Features for XGBoost
X_2 = family_sales.drop('sales', axis=1).stack()  # onpromotion feature

# Label encoding for 'family'
le = LabelEncoder()  # from sklearn.preprocessing
X_2 = X_2.reset_index('family')
X_2['family'] = le.fit_transform(X_2['family'])

# Label encoding for seasonality
X_2["day"] = X_2.index.day  # values are day of the month

## Train boosted hybrid
Create the hybrid model by initializing a BoostedHybrid class with LinearRegression() and XGBRegressor() instances.

In [None]:
# Create LinearRegression anf XGBRegressor hybrid for BoostedHybrid object
model = BoostedHybrid(
    model_1=LinearRegression(),
    model_2=XGBRegressor(),
)

# Call Fit and predict method
model.fit(X_1, X_2, y)

y_pred = model.predict(X_1, X_2)

y_pred = y_pred.clip(0.0)

Call fit and predict method

In [None]:
#create train and valid set
y_train, y_valid = y[:"2017-07-01"], y["2017-07-02":]
X1_train, X1_valid = X_1[: "2017-07-01"], X_1["2017-07-02" :]
X2_train, X2_valid = X_2.loc[:"2017-07-01"], X_2.loc["2017-07-02":]
#call fit method
model.fit(X1_train, X2_train, y_train)
#call predict method
y_fit = model.predict(X1_train, X2_train).clip(0.0)
y_pred = model.predict(X1_valid, X2_valid).clip(0.0)
#test with 6 features in the dataset
families = y.columns[0:6]
axs = y.loc(axis=1)[families].plot(
    subplots=True, sharex=True, figsize=(11, 9), **plot_params, alpha=0.5,
)
#filter and plot the y_fit
y_fit.loc(axis=1)[families].plot(subplots=True, sharex=True, color='C0', ax=axs)
#filter and plot the y_pred
y_pred.loc(axis=1)[families].plot(subplots=True, sharex=True, color='C3', ax=axs)
for ax, family in zip(axs, families):
    ax.legend([])
    ax.set_ylabel(family)

In [None]:
families

# 6. DataExecutor and xgboost - GPU to evaluate metric 

I use xgboost and reference and improve code from https://www.kaggle.com/code/koheishima/store-sales-simple-xg-boost-gpu-lb-0-44579

I use GPU accelerator to run the algorithm

# Declare packges

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import time
import calendar
import xgboost as xgb

from datetime import date, datetime
from learntools.time_series.style import *  
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

# DataExecutor class

DataExecutor class uses to load dataset, preprocessing for train, test set and another dataset

In [None]:
class DataExecutor:
    path='../input/store-sales-time-series-forecasting/'
    #load all dataset in store sales time series dataset
    def load_dataset(self):
        self.train = pd.read_csv(self.path + 'train.csv')
        self.test = pd.read_csv(self.path + 'test.csv')
        self.oil = pd.read_csv(self.path + 'oil.csv')
        self.holiday = pd.read_csv(self.path + 'holidays_events.csv')
        self.store = pd.read_csv(self.path + 'stores.csv')
        self.tran = pd.read_csv(self.path + 'transactions.csv')
        self.submission = pd.read_csv(self.path + 'sample_submission.csv')
    #add weekday, year, month, day and payday
    #this function is used for clean_train_test method
    def preprocess_train_test(self,df):
        df['date'] = df['date'].map(lambda x: date.fromisoformat(x))
        df['weekday'] = df['date'].map(lambda x: x.weekday())
        df['year'] = df['date'].map(lambda x: x.year)
        df['month'] = df['date'].map(lambda x: x.month)
        df['day'] = df['date'].map(lambda x: x.day)
        df['eomd'] = df['date'].map(lambda x: calendar.monthrange(x.year, x.month)[1])
        df['payday'] = ((df['day'] == df['eomd'])|(df['day'] == 15)).astype(int)
        df.drop(['id', 'eomd'], axis=1, inplace=True)
        return df
    #clean train test set
    def clean_train_test(self):
        self.train = self.preprocess_train_test(self.train)
        self.test = self.preprocess_train_test(self.test)
    #clean oil dataset
    def clean_oil_dataset(self):
        self.oil['month'] = self.oil['date'].map(lambda x: int(x.replace('-', '')[:6]))
        self.oil['month_avg'] = self.oil.groupby('month')['dcoilwtico'].transform('mean')
        self.oil['tmp'] = self.oil['dcoilwtico'].map(np.isnan)
        self.oil['month_avg'] = self.oil['tmp'] * self.oil['month_avg']
        self.oil['dcoilwtico'].fillna(0, inplace=True)
        self.oil['dcoilwtico'] = self.oil['dcoilwtico'] + self.oil['month_avg']
        self.oil = self.oil.drop(['month', 'month_avg', 'tmp'], axis=1)
        self.oil['date'] = self.oil['date'].map(lambda x: date.fromisoformat(x))
    #process event holidy dataset
    def process_event_holiday(self):
        self.holiday['date'] = self.holiday['date'].map(lambda x: date.fromisoformat(x))
        self.holiday = self.holiday[(self.holiday['transferred']==False)&(self.holiday['type']!='Work Day')]
        self.holiday = self.holiday[['date', 'description']]
        self.holiday.rename({'description': 'event_name'}, axis=1, inplace=True)
    #merge dataset
    #this functions is used for merge_train_test_set method
    def merge_dataset(self, df):
        df = df.merge(self.oil, on='date', how='left')
        df = df.merge(self.store, on='store_nbr', how='left')
        df = df.merge(self.holiday, on='date', how='left').fillna('0')
        df = df.merge(self.tran, on=['date', 'store_nbr'], how='left').fillna(0)
        return df
    #this function merge train test set
    def merge_train_test_set(self):
        self.train = self.merge_dataset(self.train)
        self.test = self.merge_dataset(self.test)
        self.train['dcoilwtico'] = self.train['dcoilwtico'].astype(float)
        self.test['dcoilwtico'] = self.test['dcoilwtico'].astype(float)

Declare and call method for DataExecutor

In [None]:
#create executor intance object
executor=DataExecutor()
#call load dataset method
executor.load_dataset()
#call clean train test method
executor.clean_train_test()
#call clean oil data set
executor.clean_oil_dataset()
#call process event holiday data set
executor.process_event_holiday()
#call merge train test set method
executor.merge_train_test_set()

# SimpleXGBoost class


SimpleXGBoost class uses to create features, target, run the model and export submission file

In [None]:
class SimpleXGBoost:
    features = ['family', 'store_nbr', 'city', 'state', 'type', 'cluster','event_name']
    params_for_xgb = {
                    'tree_method': 'gpu_hist', 
                    'gpu_id': 0,
                    'predictor': 'gpu_predictor', 
                    'verbosity': 2,
                    'objective': 'reg:squarederror', 
                    'eval_metric': 'rmse', 
                    'random_state': 42,
                    'learning_rate': 0.01,
                    'subsample': 1.0,
                    'colsample_bytree': 0.7,
                    'reg_alpha': 10.0,
                    'reg_lambda': 0.2,
                    'min_child_weight': 50,
                }
    def __init__(self,executor):
        self.executor=executor
    #Label encoding ang transform for train,test set
    def transform(self):       
        for col in self.features:
            le = LabelEncoder()
            self.executor.train[col] = le.fit_transform(self.executor.train[col])
            self.executor.test[col] = le.transform(self.executor.test[col])
    #createt train and valid period set
    #n1 is from index
    #n2 is to index
    def creat_train_valid_period_set(self,n1,n2):
        self.train_date = self.executor.train['date'].unique()[n1:n2].tolist()
        self.valid_date = self.executor.train['date'].unique()[n2:].tolist()
        self.executor.train['is_train'] = self.executor.train['date'].map(lambda x: x in self.train_date)
        self.executor.train['is_valid'] = self.executor.train['date'].map(lambda x: x in self.valid_date)
    #create train valid period set
    def print_train_valid_period_set(self):
        print('train date from {} to {}'.format(min(self.train_date), max(self.train_date)))
        print('valid date from {} to {}'.format(min(self.valid_date), max(self.valid_date)))
    #setup feature target
    def setup_feature_target(self):
        self.y = np.log(self.executor.train['sales'] + 1)
        self.X_train = self.executor.train.drop(['date', 'sales', 'year'], axis=1)
        self.X_test = self.executor.test.drop(['date', 'year'], axis=1)
    #run xgboost    
    def run_xgboost(self):
        start = time.time()    
        # extract train and valid dataset
        train_idx = self.X_train[self.X_train['is_train']==True].index.tolist()
        val_idx = self.X_train[self.X_train['is_valid']==True].index.tolist()

        X_tr = self.X_train.loc[train_idx, :].drop(['is_train', 'is_valid'], axis=1)
        self.X_val = self.X_train.loc[val_idx, :].drop(['is_train', 'is_valid'], axis=1)
        y_tr = self.y[train_idx]
        self.y_val = self.y[val_idx]

        xgb_train = xgb.DMatrix(X_tr, label=y_tr)
        xgb_valid = xgb.DMatrix(self.X_val, label=self.y_val)
        evallist = [(xgb_train, 'train'), (xgb_valid, 'eval')]
        self.evals_result = dict()

        self.model = xgb.train(params=self.params_for_xgb, dtrain=xgb_train, 
                          evals=evallist, evals_result=self.evals_result,
                          verbose_eval=5000, num_boost_round=100000, early_stopping_rounds=100)

        self.xgb_oof = np.zeros(self.y_val.shape[0])
        self.xgb_oof = self.model.predict(xgb_valid, iteration_range=(0, self.model.best_iteration))

        xgb_test = xgb.DMatrix(self.X_test)
        self.xgb_pred = pd.Series(self.model.predict(xgb_test, iteration_range=(0, self.model.best_iteration)),
                             name='xgb_pred')

        elapsed = time.time() - start
        error_value = mean_squared_error(self.y_val, self.xgb_oof, squared=False)
        print(f"xgboost rmse: {error_value:.5f}, elapsed time: {elapsed:.2f}sec\n")

        return self.xgb_oof, self.model, self.evals_result, self.xgb_pred, self.y_val, self.X_val
    #create function to plot the data
    def plot_data(self):
        df_error = self.X_val[['store_nbr', 'family']].copy()
        df_error.reset_index(drop=True, inplace=True)
        df_error['oof'] = pd.Series(self.xgb_oof)
        df_error['y_valid'] = self.y_val.reset_index(drop=True).copy()
        y_oof = df_error[(df_error['store_nbr']==1)&(df_error['family']==12)]['oof'].tolist()
        self.y_val = df_error[(df_error['store_nbr']==1)&(df_error['family']==12)]['y_valid'].tolist()
        sns.lineplot(x=range(len(y_oof)), y=y_oof)
        sns.lineplot(x=range(len(y_oof)), y=self.y_val)
    #export submission file to upload to the competition
    def export_submission(self):
        self.executor.submission['sales'] = np.exp(self.xgb_pred.map(lambda x: max(x, 0))) - 1
        self.executor.submission.to_csv('submission.csv', index=False)
        print("export submission.csv is successful!")


Declare and call method for SimpleXGBoost

In [None]:
#create SimpleXGBoost instance object
sxgb=SimpleXGBoost(executor)
#call transform method
sxgb.transform()
#call create train valid period set
sxgb.creat_train_valid_period_set(-300,-10)
#call print train valid period set
sxgb.print_train_valid_period_set()
#call setup feature target method
sxgb.setup_feature_target()

In [None]:
#call run xgboost method

sxgb.run_xgboost()

In [None]:
#call plot data
sxgb.plot_data()

In [None]:
#call export submission file
sxgb.export_submission()