**Didn't find good, noob friendly, notebook, so I decided to do it by myself with information I learned from other notebooks.**

References:  
https://www.kaggle.com/hiro5299834/store-sales-ridge-voting-bagging-et-bagging-rf  
https://www.kaggle.com/andrej0marinchenko/hyperparamaters#DeterministicProcess  
https://www.kaggle.com/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guide  

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

PATH = '../input/store-sales-time-series-forecasting/'

# Data Analysis (Understanding the datset and the task)

The first thing I do, is reading dataset [notes](https://www.kaggle.com/c/store-sales-time-series-forecasting/data), provided by author.    
We have 7 (actually 6) csv files to work with, each of them we should analyse individually.  
Our target: predict sales for the thousands of product families sold at Favorita stores.   
There are 54 stores and my idea is to train the model on each store's data separately. 

Now let's get closer look at each table.

## Train and Test datasets

In [None]:
train = pd.read_csv(PATH + 'train.csv', dtype={'store_nbr': 'category'}, 
                    usecols=['store_nbr', 'family', 'date', 'sales', 'onpromotion'])
test = pd.read_csv(PATH + 'test.csv', dtype={'store_nbr': 'category'},
                   usecols=['store_nbr', 'family', 'date', 'onpromotion'])

# Check for missing values
print('Missing values in train:', train.isna().sum().sum())
print('Missing values in test:', test.isna().sum().sum())

# There are some missing dates in training set ['2013-12-25', '2014-12-25', '2015-12-25', '2016-12-25'] 

train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])

In [None]:
train.head(20)

A lot of zeros, it can means that store was not working yet, or we just do not have infromation.  
We should plot train dataset, and watch closely.   

In [None]:
temp = train.set_index('date').groupby('store_nbr').resample('D').sales.sum().reset_index()
px.line(temp, x='date', y='sales', color='store_nbr',
        title='Daily total sales of the stores')

Some of the stores don't have sales information until 2014, 2015, or 2017.   
Let's fix it.  

In [None]:
print(train.shape)
train = train[~((train.store_nbr == '52') & (train.date < "2017-04-20"))]
train = train[~((train.store_nbr == '22') & (train.date < "2015-10-09"))]
train = train[~((train.store_nbr == '42') & (train.date < "2015-08-21"))]
train = train[~((train.store_nbr == '21') & (train.date < "2015-07-24"))]
train = train[~((train.store_nbr == '29') & (train.date < "2015-03-20"))]
train = train[~((train.store_nbr == '20') & (train.date < "2015-02-13"))]
train = train[~((train.store_nbr == '53') & (train.date < "2014-05-29"))]
train = train[~((train.store_nbr == '36') & (train.date < "2013-05-09"))]
print(train.shape)

In our train and test datasets we have one interesting feature - **onpromotion**, which means total number of items in a product family that were being promoted at a giving date.   
Sounds like it should influence well on sales.

In [None]:
train.corr('spearman').sales.loc['onpromotion']

Not bad result, but how can we use it? I have one idea.

## Transactions

In [None]:
transactions = pd.read_csv(PATH + 'transactions.csv', dtype={'store_nbr': 'category'})
transactions.head()

In [None]:
print('Missing values in transactions:', train.isna().sum().sum())

# Similar to training set, we have missing dates ['2013-12-25', '2014-12-25', '2015-12-25', 
#                                                 '2016-01-01', '2016-01-03', '2016-12-25']

transactions['date'] = pd.to_datetime(transactions['date'])

# Proof that transactions are highly correlated with sales
temp = pd.merge(train.groupby(['date', 'store_nbr']).sales.sum().reset_index(),
                transactions, how='left')
print(temp.corr("spearman").sales.loc["transactions"])

# Now we can proof that stores on holidays make more money than on working days
temp = transactions.copy()
temp['year'] = temp.date.dt.year
temp['day_of_week'] = temp.date.dt.dayofweek + 1
temp = temp.groupby(['year', 'day_of_week']).transactions.mean().reset_index()

px.line(temp, x='day_of_week', y='transactions', color='year', title='Transactions')

After the visual analysis, it is obvious that we should set holidays correctly to get better results.  
This is all for transactions, we do not need them for training models.  

## Holidays Events

In [None]:
def strip_spaces(a_str_with_spaces):
    return a_str_with_spaces.replace(' ', '')

holidays = pd.read_csv(PATH + 'holidays_events.csv', index_col='date',
                       parse_dates=['date'], infer_datetime_format=True,
                       converters={'locale_name': strip_spaces})  # removes spaces from locale_name

holidays.head()

In [None]:
# By printing unique labels, we can check data on misspells, and get better data understanding
print('Holidays types:', holidays['type'].unique())
print('Holidays region types:', holidays['locale'].unique()) 
print('Holidays locale names:', holidays['locale_name'].unique())  

In [None]:
# What about missing values
holidays.isna().sum()  

We already know about **type** and **transferred** columns from dataset description. And now we understand what **locale** and **locale_name** are.

**Now**: we need to create a full calendar and specify working days and not working days.

In [None]:
# Calendar
holidays_rdy = pd.DataFrame(index=pd.date_range('2013-01-01', '2017-08-31'))
holidays_rdy['day_of_week'] = holidays_rdy.index.dayofweek + 1  # Monday = 1, Sunday = 7
holidays_rdy['work_day'] = True
holidays_rdy.loc[holidays_rdy['day_of_week'] > 5, 'work_day'] = False  # False for saturdays and sundays 

# Fixing index duplicates in holidays dataset
duplicates = holidays[holidays.index.duplicated(keep=False)]
print(duplicates['locale_name'])

# This was done manually
duplicates = [('2012-06-25', 'Latacunga Machala'), ('2012-07-03', 'ElCarmen'),
              ('2012-12-22', 'Ecuador'), ('2012-12-24', 'Ecuador'),
              ('2012-12-31', 'Ecuador'), ('2013-05-12', 'Ecuador'),
              ('2013-06-25', 'Machala Latacunga'), ('2013-07-03', 'SantoDomingo'),
              ('2013-12-22', 'Salinas'), ('2014-06-25', 'Machala Imbabura Ecuador'),
              ('2014-07-03', 'SantoDomingo'), ('2014-12-22', 'Ecuador'),
              ('2014-12-26', 'Ecuador'), ('2015-06-25', 'Imbabura Latacunga'),
              ('2015-07-03', 'SantoDomingo'), ('2015-12-22', 'Salinas'),
              ('2016-04-21', 'Ecuador'), ('2016-05-01', 'Ecuador'),
              ('2016-05-07', 'Ecuador'), ('2016-05-08', 'Ecuador'),
              ('2016-05-12', 'Ecuador'), ('2016-06-25', 'Imbabura Latacunga'),
              ('2016-07-03', 'SantoDomingo'), ('2016-07-24', 'Guayaquil'),
              ('2016-11-12', 'Ecuador'), ('2016-12-22', 'Salinas'),
              ('2017-04-14', 'Ecuador'), ('2017-06-25', 'Latacunga Machala'),
              ('2017-07-03', 'SantoDomingo'), ('2017-12-08', 'Quito'),
              ('2017-12-22', 'Ecuador')]
# No holidays was transferred in duplicates

holidays = holidays.groupby(holidays.index).first() # we left only first, but we need others too
for date, locale_name in duplicates:
    holidays.loc[date, 'locale_name'] = holidays.loc[date, 'locale_name'] + ' ' + locale_name

In [None]:
# Apply holidays to calendar
holidays_rdy = holidays_rdy.merge(holidays, how='left', left_index=True, right_index=True)

# type column: 'Work Day'
holidays_rdy.loc[holidays_rdy['type'] == 'Work Day', 'work_day'] = True

# type column: 'Holiday', 'Transfer', 'Additional', 'Bridge'
holidays_rdy.loc[(holidays_rdy['type'] == 'Holiday') &
                 (holidays_rdy['locale_name'].str.contains('Ecuador', na=False)),
                 'work_day'] = False
holidays_rdy.loc[(holidays_rdy['type'] == 'Transfer') & 
                 (holidays_rdy['locale_name'].str.contains('Ecuador', na=False)),  
                 'work_day'] = False
holidays_rdy.loc[(holidays_rdy['type'] == 'Additional') & 
                 (holidays_rdy['locale_name'].str.contains('Ecuador', na=False)),
                 'work_day'] = False
holidays_rdy.loc[(holidays_rdy['type'] == 'Bridge') & 
                 (holidays_rdy['locale_name'].str.contains('Ecuador', na=False)),   
                 'work_day'] = False

holidays_rdy.drop(['locale'], axis=1, inplace=True)

# transferred column
holidays_rdy.loc[holidays_rdy['transferred'] == True, 'work_day'] = True

**type column: 'Event'**  
There are multiple events in dataset: **Mother's day**, **Footbal** championship, **Black Friday**, **Cyber Monday**, Manabi **Earthquake** (about a month long)  
We should understand how does it affect our sales.

In [None]:
# First let's look at events
events = holidays_rdy[holidays_rdy['type']=='Event']
events

All events are national, no events were transferred.  
We should set one label for all football events, same for earthquake.  

In [None]:
# I do it for simplicity
holidays_rdy.loc[holidays_rdy['description'].str.contains('Terremoto', na=False),
                 'description'] = 'Earthquake'
holidays_rdy.loc[holidays_rdy['description'].str.contains('futbol', na=False), 
                 'description'] = 'Football'
events = holidays_rdy[holidays_rdy['type']=='Event']

# Check for misspells
print(events['description'].unique())

# Print mean sales 
sales = train.groupby(['date']).sales.sum()
events = events.merge(sales, how='left', left_index=True, right_index=True)
print(events.groupby(['description']).sales.mean())
print('All sales mean:', sales.mean())

Imprecise method because we do not have enough data, but **Earthquake** and **Cyber Monday** definitely should be considered during training, + **Black Friday**.  
Sales are not depends much on **Football** and **Mother's day**.

In [None]:
# descriptions 
descriptions = pd.get_dummies(holidays_rdy['description'])[['Earthquake', 'Cyber Monday', 'Black Friday']]
holidays_rdy = holidays_rdy.merge(descriptions, how='left', left_index=True, right_index=True)

# Fill NaNs
holidays_rdy['locale_name'].fillna('Ecuador', inplace=True)

# Get rid of useless columns
holidays_rdy.drop(['type', 'description', 'transferred'], axis=1, inplace=True)

In [None]:
# If you want to merge two dataframes, they should have same indexes, later we will need it
holidays_rdy['date'] = holidays_rdy.index
holidays_rdy['date'] = pd.to_datetime(holidays_rdy['date'])
holidays_rdy['date'] = holidays_rdy['date'].dt.to_period('D')
holidays_rdy = holidays_rdy.set_index(['date'])

holidays_rdy = pd.get_dummies(holidays_rdy, columns=['day_of_week'])

holidays_rdy.head()

## Oil

In [None]:
oil = pd.read_csv(PATH + 'oil.csv')
oil['date'] = pd.to_datetime(oil['date'])
oil.head()

Without any digging, we can see missing prices and dates, which are better to fill in.  

In [None]:
# Resample
oil = oil.set_index('date')['dcoilwtico'].resample(
    'D').sum().reset_index()  # add missing dates and fill NaNs with 0 

# Interpolate
oil['dcoilwtico'] = np.where(oil['dcoilwtico']==0, np.nan, oil['dcoilwtico'])  # replace 0 with NaN
oil['dcoilwtico_interpolated'] = oil.dcoilwtico.interpolate()  # fill NaN values using an interpolation method

oil.head(10)

In [None]:
temp = oil.melt(id_vars=['date'], var_name='Legend') 
px.line(temp.sort_values(['Legend', 'date'], ascending=[False, True]), x='date',
        y='value', color='Legend', title='Daily Oil Price')

In [None]:
oil_rdy = oil.loc[:, ['date', 'dcoilwtico_interpolated']]
oil_rdy.iloc[0, 1] = 93.1

assert oil_rdy.isna().sum().sum() == 0

oil_rdy['date'] = pd.to_datetime(oil_rdy['date'])
oil_rdy['date'] = oil_rdy['date'].dt.to_period('D')
oil_rdy = oil_rdy.set_index(['date'])
oil_rdy

But what if oil prices don't influence sales? Why do we need an oil dataset?  
Here you are.  
For some columns we can see strong correlation.

In [None]:
import matplotlib.pyplot as plt

def plot_sales_and_oil_dependency():
    a = pd.merge(train.groupby(["date", "family"]).sales.sum().reset_index(),
                 oil.drop("dcoilwtico", axis=1), how="left")
    c = a.groupby("family").corr("spearman").reset_index()
    c = c[c.level_1 == "dcoilwtico_interpolated"][["family", "sales"]].sort_values("sales")
    
    fig, axes = plt.subplots(7, 5, figsize=(20, 20))
    for i, fam in enumerate(c.family):
        a[a.family == fam].plot.scatter(x="dcoilwtico_interpolated", y="sales", ax=axes[i // 5, i % 5])
        axes[i // 5, i % 5].set_title(fam + "\n Correlation:" + str(c[c.family == fam].sales.iloc[0])[:6],
                                 fontsize=12)
        axes[i // 5, i % 5].axvline(x=70, color='r', linestyle='--')

    plt.tight_layout(pad=5)
    plt.suptitle("Daily Oil Product & Total Family Sales \n", fontsize=20)
    plt.show()

plot_sales_and_oil_dependency()

Add **rolling mean** and **lags**  

In [None]:
oil_rdy['rolling_mean_7'] = oil_rdy['dcoilwtico_interpolated'].rolling(7).mean()
oil_rdy.fillna(93.1, inplace=True)
oil_rdy

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf

_ = plot_pacf(oil_rdy.rolling_mean_7, lags=12, method='ywm')  # 1 lag

In [None]:
for i in range(1, 2) :
    oil_rdy[f'oil_lag_{i}'] = oil_rdy.rolling_mean_7.shift(i)
oil_rdy.fillna(93.1, inplace=True)

## Stores

In [None]:
stores = pd.read_csv(PATH + 'stores.csv', index_col='store_nbr',
                     converters={'city': strip_spaces, 'state': strip_spaces})  # removes spaces

stores.head()

In [None]:
# Let's look at the unique labels
print('Cities:\n', stores['city'].unique())  
print('States:\n', stores['state'].unique())  
print('Store types:\n', stores['type'].unique())  # no type information was provided in data description
print('Clusters:\n', sorted(list(stores['cluster'].unique())))


# Do not forget about missing values
print('Missing values:', stores.isna().sum().sum())

We should connect stores and holidays by location, because holidays can be local or regional (in one city, or in the whole state).

In [None]:
stores_rdy = stores.loc[:, ['city', 'state']]
stores_rdy.head()

# Preprocessing

I want to split training and testing datasets to train models for each store separately.  
Because of the local holidays.  
Yeah, maybe it's not the best idea, but lets try.  

## y

In [None]:
train['date'] = train['date'].dt.to_period('D')
train_rdy = train.set_index(['store_nbr', 'family', 'date']).sort_index()
display(train_rdy)

test['date'] = test['date'].dt.to_period('D')
test_rdy = test.set_index(['store_nbr', 'family', 'date']).sort_index()
display(test_rdy)

In [None]:
sdate = '2017-03-25' # start and end of training date
edate = '2017-08-15'

y_arr = []
onpromotion_arr = []
test_onpromotion_arr = []

for nbr in stores.index:
    # y_arr
    temp = train_rdy.loc[str(nbr), 'sales']
    y_arr.append(temp.unstack(['family']).loc[sdate:edate])
    
    # onpromotion_arr
    onpromotion = train_rdy.loc[str(nbr), 'onpromotion']
    onpromotion = onpromotion.unstack(['family']).loc[sdate:edate].sum(axis=1)
    onpromotion.name = 'onpromotion'
    onpromotion_arr.append(onpromotion)
    
    # test_onpromotion_arr
    test_onpromotion = test_rdy.loc[str(nbr), 'onpromotion']
    test_onpromotion = test_onpromotion.unstack(['family']).sum(axis=1)
    test_onpromotion.name = 'onpromotion'
    test_onpromotion_arr.append(test_onpromotion)
    
    # sum of all products that are on promotion in the currect store
    
    
y_arr[1] # sales of store_nbr 2

## X

In [None]:
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

fourier = CalendarFourier(freq='W', order=4)
X_arr = []
X_test_arr = []
store_index = 1

for y, onpromotion, test_onpromotion in zip(y_arr, onpromotion_arr, test_onpromotion_arr): 
    dp = DeterministicProcess(index=y.index,
                              constant=False,
                              order=1,
                              seasonal=False,
                              additional_terms=[fourier],
                              drop=True)
    X = dp.in_sample()
    X_test = dp.out_of_sample(steps=16)
    
    # On promotion
    X = X.merge(onpromotion, how='left', left_index=True, right_index=True)
    X_test = X_test.merge(test_onpromotion, how='left', left_index=True, right_index=True)
    
    # Holidays
    X = X.merge(holidays_rdy, how='left', left_index=True, right_index=True)
    X_test = X_test.merge(holidays_rdy, how='left', left_index=True, right_index=True)
    
    store_state = stores.loc[store_index, 'state']
    store_city = stores.loc[store_index, 'city']
    
    # Apply local holidays
    for j in X.index: 
        if X.loc[j, 'locale_name'].find(store_state) != -1 or X.loc[j, 'locale_name'].find(store_city) != -1:
            X.loc[j, 'work_day'] = False
    
    for j in X_test.index: 
        if X_test.loc[j, 'locale_name'].find(store_state) != -1 or X_test.loc[j, 'locale_name'].find(store_city) != -1:
            X_test.loc[j, 'work_day'] = False
    
    X.drop(['locale_name'], axis=1, inplace=True)  
    X_test.drop(['locale_name'], axis=1, inplace=True)  
    
    # Oil
    X = X.merge(oil_rdy, how='left', left_index=True, right_index=True)
    X_test = X_test.merge(oil_rdy, how='left', left_index=True, right_index=True)
    
    X_arr.append(X)
    X_test_arr.append(X_test)
    
    store_index += 1
    
X_arr[0]

In [None]:
X_test_arr[0]

# Modelling

First of all, let's try Ridge Regressor and see at the results

In [None]:
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split

ridge = make_pipeline(RobustScaler(),
                      Ridge(alpha=31.0))

t_errors = []
v_errors = []

# Collect errors for each store
for X, y in zip(X_arr, y_arr):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3,
                                                      random_state=1, shuffle=False)
    model = ridge.fit(X_train, y_train)
    train_pred = pd.DataFrame(model.predict(X_train), index=X_train.index,
                              columns=y_train.columns).clip(0.0)
    val_pred = pd.DataFrame(model.predict(X_val), index=X_val.index,
                            columns=y_val.columns).clip(0.0)

    y_train = y_train.stack(['family']).reset_index()
    y_train['pred'] = train_pred.stack(['family']).reset_index().loc[:, 0]

    y_val = y_val.stack(['family']).reset_index()
    y_val['pred'] = val_pred.stack(['family']).reset_index().loc[:, 0]

    t_errors.append(y_train.groupby('family').apply(
        lambda r: mean_squared_log_error(r.loc[:, 0], r['pred'])))
    v_errors.append(y_val.groupby('family').apply(
        lambda r: mean_squared_log_error(r.loc[:, 0], r['pred'])))

In [None]:
# Sum of mean squared log error from validation dataset
sum(v_errors).sort_values(ascending=False)

Here we can see that **SCHOOL AND OFFICE SUPPLIES** error is much higher than others.  
We need to create custom regressor, and specify different models to deal with this problem.

I took already working well custom regressor, and made parameters tuning for Ridge and SVR.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
    
from joblib import Parallel, delayed    
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor, VotingRegressor

class CustomRegressor:
    def __init__(self, n_jobs: int = -1, seed: int = 1):
        self.n_jobs = n_jobs
        self.seed = seed
        self._estimators = None

    def _get_model(self, x_, y_):
        if y_.name == 'SCHOOL AND OFFICE SUPPLIES':
            etr = ExtraTreesRegressor(n_estimators=500, n_jobs=self.n_jobs,
                                      random_state=self.seed)
            rfr = RandomForestRegressor(n_estimators=500, n_jobs=self.n_jobs,
                                        random_state=self.seed)
            br1 = BaggingRegressor(base_estimator=etr, n_estimators=10,
                                   n_jobs=self.n_jobs, random_state=self.seed)
            br2 = BaggingRegressor(base_estimator=rfr, n_estimators=10,
                                   n_jobs=self.n_jobs, random_state=self.seed)
            model = VotingRegressor([('ExtraTrees', br1), ('RandomForest', br2)])
        else:
            ridge = make_pipeline(RobustScaler(),
                                  Ridge(alpha=31.0, random_state=self.seed))
            svr = make_pipeline(RobustScaler(),
                                SVR(C=1.68, epsilon=0.09, gamma=0.07))

            model = VotingRegressor([('ridge', ridge), ('svr', svr)])

        model.fit(x_, y_)
        return model

    def fit(self, x_, y_):
        self._estimators = Parallel(n_jobs=self.n_jobs, verbose=0) \
            (delayed(self._get_model)(x_, y_.iloc[:, i]) for i in range(y_.shape[1]))

    def predict(self, x_):
        y_pred = Parallel(n_jobs=self.n_jobs, verbose=0) \
            (delayed(e.predict)(x_) for e in self._estimators)

        return np.stack(y_pred, axis=1)

**Warning**: it will take a while

In [None]:
# Get fitted models
models = []
for X, y in zip(X_arr, y_arr):
    model = CustomRegressor()
    model.fit(X, y)
    models.append(model)

In [None]:
# Get predictions
results = []
for X_test, model, y in zip(X_test_arr, models, y_arr):
    y_pred = pd.DataFrame(model.predict(X_test), index=X_test.index, columns=y.columns).clip(0.0)
    results.append(y_pred.stack(['family']))
    
results[0]

# Submission
To create submission we need concatenate all predictions in one dataframe.  
**Note:** originaly data was sorted by store_nbr as string, so it looked like this 1, 10, 11, 12, ... 2, 20, 21, 22, ...  
To concatenate predictions correctly 


In [None]:
# Get correct dates for submission
dates = ['2017-08-16', '2017-08-17', '2017-08-18', '2017-08-19', '2017-08-20', '2017-08-21',
         '2017-08-22', '2017-08-23', '2017-08-24', '2017-08-25', '2017-08-26', '2017-08-27',
         '2017-08-28', '2017-08-29', '2017-08-30', '2017-08-31']

# Get correct order for submission
order = list(range(1, len(results) + 1))
str_map = map(str, order)
correct_order_str = sorted(list(str_map))
int_minus_one = lambda element: int(element) - 1
correct_order_int = list(map(int_minus_one, correct_order_str))

# Create and fill list with predictions in the correct order
data = []
for date in dates:
    for i in correct_order_int:
        data += results[i].loc[date].to_list()

# Create dataframe from the list
result = pd.DataFrame(data, columns = ['sales'])

# We can use sample_submission.csv to make submission
submission = pd.read_csv(PATH + 'sample_submission.csv')
submission['sales'] = result['sales']
submission

In [None]:
# Save submission
submission.to_csv('submission.csv', index = False)

# Conclusion
In conclusion I want to say thank you to Kaggle comunity, you guys doing very cool stuff.  
All links to resources I used, you can find on the top of the notebook.

**If you find any errors in this notebook, pls let me know.**