#### What are you trying to do in this notebook?
This notebook contains the dataset with dates of festivities in Finland, Norway and Sweden in the years used by the competitions.
We will be predicting a full year worth of sales for three items at two stores located in three different countries. This dataset is completely fictional, but contains many effects you see in real-world data, e.g., weekend and holiday effect, seasonality, etc. The dataset is small enough to allow us to try numerous different modeling approaches.

#### Why are you trying it?
- To check the data structure. 
- We have a combination of time series based on countries, stores and products. 
- To check all the combinations appear in train and test.
- To visualize the timing split between train and test.
- To verify that no date is missing from train and test.

#### What we learned while making this notebook?
- To explore the data.
- Creating panels of products, countries, shops.
- To explore seasonality based on months.
- To examine seasonality at a week level.
- To obesere recurrences at a monthly level.
- To enrich the data (using festivities and GDP data), etc.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

In [None]:
# Loading train and test data
train = pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv", parse_dates=['date'])
test = pd.read_csv("../input/tabular-playground-series-jan-2022/test.csv", parse_dates=['date'])

In [None]:
train.dtypes

In [None]:
# figuring out the theoretically possible level combination
time_series = ['country', 'store', 'product']
combinations = 1
for feat in time_series:
    combinations *= train[feat].nunique()
    
print(f"There are {combinations} possible combinations")

In [None]:
time_series = ['country', 'store', 'product']
country_store_product_train = train[time_series].drop_duplicates().sort_values(time_series)
country_store_product_test =test[time_series].drop_duplicates().sort_values(time_series)

cond_1 = len(country_store_product_train) == combinations
print(f"Are all theoretical combinations present in train: {cond_1}")
cond_2 = (country_store_product_train == country_store_product_test).all().all()
print(f"Are combinations the same in train and test: {cond_2}")

In [None]:
train_dates = train.date.drop_duplicates().sort_values()
test_dates = test.date.drop_duplicates().sort_values()

fig, ax = plt.subplots(1, 1, figsize = (11, 7))
cmap_cv = plt.cm.coolwarm

color_index = np.array([1] * len(train_dates) + [0] * len(test_dates))

ax.scatter(range(len(train_dates)), [.5] * len(train_dates),
           c=color_index[:len(train_dates)], marker='_', lw=15, cmap=cmap_cv,
           label='train', vmin=-.2, vmax=1.2)

ax.scatter(range(len(train_dates), len(train_dates) + len(test_dates)), [.55] * len(test_dates),
           c=color_index[len(train_dates):], marker='_', lw=15, cmap=cmap_cv,
           label='test', vmin=-.2, vmax=1.2)

tick_locations = np.cumsum([0, 365, 366, 365, 365, 365])
for i in (tick_locations):
    ax.vlines(i, 0, 2,linestyles='dotted', colors = 'grey')
    
ax.set_xticks(tick_locations)
ax.set_xticklabels([2015, 2016, 2017, 2018, 2019, 2020], rotation = 0)
ax.set_yticklabels(labels=[])
plt.ylim([0.45, 0.60])
ax.legend(loc="upper left", title="data")

plt.show()

In [None]:
missing_train = pd.date_range(start=train_dates.min(), end=train_dates.max()).difference(train_dates)
missing_test = pd.date_range(start=test_dates.min(), end=test_dates.max()).difference(test_dates)
print(f"missing dates in train: {len(missing_train)} and in test: {len(missing_test)}")

In [None]:
# We create different time granularity

def process_time(df):
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['week'] = df['date'].dt.isocalendar().week
    df['week'][df['week']>52] = 52
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek
    df['quarter'] = df['date'].dt.quarter
    df['dayofyear'] = df['date'].dt.dayofyear
    return df

train = process_time(train)
test = process_time(test)

In [None]:
for product in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
    print(f"\n--- {product} ---\n")
    fig = plt.figure(figsize=(20, 10), dpi=100)
    fig.subplots_adjust(hspace=0.25)
    for i, store in enumerate(['KaggleMart', 'KaggleRama']):
        for j, country in enumerate(['Finland', 'Norway', 'Sweden']):
            ax = fig.add_subplot(2, 3, (i*3+j+1))
            selection = (train['country']==country)&(train['store']==store)&(train['product']==product)
            selected = train[selection]
            selected.set_index('date').groupby('year')['num_sold'].mean().plot(ax=ax)
            ax.set_title(f"{country}:{store}")
    plt.show()

In [None]:
for product in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
    fig = plt.figure(figsize=(20, 10), dpi=100)
    fig.subplots_adjust(hspace=0.25)
    for i, store in enumerate(['KaggleMart', 'KaggleRama']):
        for j, country in enumerate(['Finland', 'Norway', 'Sweden']):
            ax = fig.add_subplot(2, 3, (i*3+j+1))
            selection = (train['country']==country)&(train['store']==store)&(train['product']==product)
            selected = train[selection]
            for year in [2015, 2016, 2017, 2018]:
                selected[selected.year==year].set_index('date').groupby('month')['num_sold'].mean().plot(ax=ax, label=year)
            ax.set_title(f"{product} | {country}:{store}")
            ax.legend()
    plt.show()

In [None]:
for product in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
    print(f"\n--- {product} ---\n")
    fig = plt.figure(figsize=(20, 10), dpi=100)
    fig.subplots_adjust(hspace=0.25)
    for i, store in enumerate(['KaggleMart', 'KaggleRama']):
        for j, country in enumerate(['Finland', 'Norway', 'Sweden']):
            ax = fig.add_subplot(2, 3, (i*3+j+1))
            selection = (train['country']==country)&(train['store']==store)&(train['product']==product)
            selected = train[selection]
            for year in [2015, 2016, 2017, 2018]:
                selected[selected.year==year].set_index('date').groupby('week')['num_sold'].mean().plot(ax=ax, label=year)
            ax.set_title(f"{country}:{store}")
            ax.legend()
    plt.show()

In [None]:
for product in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
    print(f"\n--- {product} ---\n")
    fig = plt.figure(figsize=(20, 10), dpi=100)
    fig.subplots_adjust(hspace=0.25)
    for i, store in enumerate(['KaggleMart', 'KaggleRama']):
        for j, country in enumerate(['Finland', 'Norway', 'Sweden']):
            ax = fig.add_subplot(2, 3, (i*3+j+1))
            selection = (train['country']==country)&(train['store']==store)&(train['product']==product)
            selected = train[selection]
            for year in [2015, 2016, 2017, 2018]:
                selected[selected.year==year].set_index('date').groupby('day')['num_sold'].mean().plot(ax=ax, label=year)
            ax.set_title(f"{country}:{store}")
            ax.legend()
    plt.show()

In [None]:
for product in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
    print(f"\n--- {product} ---\n")
    fig = plt.figure(figsize=(20, 10), dpi=100)
    fig.subplots_adjust(hspace=0.25)
    for i, store in enumerate(['KaggleMart', 'KaggleRama']):
        for j, country in enumerate(['Finland', 'Norway', 'Sweden']):
            ax = fig.add_subplot(2, 3, (i*3+j+1))
            selection = (train['country']==country)&(train['store']==store)&(train['product']==product)
            selected = train[selection]
            for year in [2015, 2016, 2017, 2018]:
                selected[selected.year==year].set_index('date').groupby('dayofweek')['num_sold'].sum().plot(ax=ax, label=year)
            ax.set_title(f"{country}:{store}")
            ax.legend()
    plt.show()

In [None]:
festivities = pd.read_csv("../input/festivities-in-finland-norway-sweden-tsp-0122/nordic_holidays.csv",
                          parse_dates=['date'],
                          usecols=['date', 'country', 'holiday'])

In [None]:
gdp = pd.read_csv("../input/gdp-20152019-finland-norway-and-sweden/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv")
gdp = np.concatenate([gdp[['year', 'GDP_Finland']].values, 
                      gdp[['year', 'GDP_Norway']].values, 
                      gdp[['year', 'GDP_Sweden']].values])
gdp = pd.DataFrame(gdp, columns=['year', 'gdp'])
gdp['country'] = ['Finland']*5 + ['Norway']*5 +['Sweden']*5

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

def process_data(df):
    
    processed = dict()
    processed['row_id'] = df['row_id']
    
    print("creating dummies for main effects of time, country, store and product")
    to_dummies = ['country', 'store', 'product', 'month', 'week', 'day', 'dayofweek']
    for feat in to_dummies:
        tmp = pd.get_dummies(df[feat])
        for col in tmp.columns:
            processed[feat+'_'+str(col)] = tmp[col]
    
    print("creating dummies with 7 gg halo effect for Nordic holidays")
    tmp = pd.get_dummies(
        df.merge(festivities, on=['date', 'country'], how='left').sort_values('row_id')['holiday'])
    for col in tmp.columns:
            peak = tmp[col].values + tmp[col].rolling(7).mean().fillna(0).values
            processed['holiday_'+str(col)] =  peak
    
    print("creating interactions")
    high_lvl_interactions = [
        ['country', 'product', 'month'],
        ['country', 'product', 'week'],
        ['country', 'store', 'week'],
        ['country', 'product', 'month', 'day'],
        ['country', 'product', 'month', 'dayofweek'],
    ]
    for sel in high_lvl_interactions:
        tmp = pd.get_dummies(df[sel].apply(lambda row: '_'.join(row.values.astype(str)), axis=1))
        for col in tmp.columns:
            processed[col] = tmp[col]
            
    print("modelling time as continuous per each country")
    for country in ['Finland', 'Norway', 'Sweden']:
        processed[country + '_prog'] = ((df.row_id // 18) + 1) * (df['country']==country).astype(int)
        processed[country + '_prog^2'] = (processed[country + '_prog']**2)
        processed[country + '_prog^3'] = (processed[country + '_prog']**3)
        
    print("adding gdp")
    gdp_countries = df.merge(gdp, on=['country', 'year'], how='left')['gdp'].values
    for country in ['Finland', 'Norway', 'Sweden']:
        processed['gdp_'+ country] = gdp_countries * (df['country']==country).astype(int)
            
    print(f"completed processing {len(processed)-1} features")
    
    values = list()
    columns = list()
    for key, value in processed.items():
        values.append(np.array(value))
        columns.append(key)
        
    values = np.array(values).T        
    processed = pd.DataFrame(values, columns=columns)
    
    print("resorting row ids")
    processed = processed.sort_values('row_id').set_index('row_id')
    return processed

def process_target(df):
    target = pd.DataFrame({'row_id':df['row_id'], 'num_sold':df['num_sold']})
    target = target.sort_values('row_id').set_index('row_id')
    return target

train_test = process_data(train.append(test))

processed_train = train_test.iloc[:len(train)].copy()
processed_test = train_test.iloc[len(train):].copy()

target = np.ravel(process_target(train))    

In [None]:
def weighting(df, weights):
    return df.year.replace(weights).values
    
weights = weighting(train, {2015:0.125, 2016:0.25, 2017:0.5, 2018:1})

In [None]:
def SMAPE(y_true, y_pred):
    # From https://www.kaggle.com/cpmpml/smape-weirdness
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

def SMAPE_exp(y_true, y_pred):
    y_true = np.exp(y_true)
    y_pred = np.exp(y_pred)
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

def SMAPE_err(y_true, y_pred):
    # From https://www.kaggle.com/cpmpml/smape-weirdness
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return diff

In [None]:
# due to float calculations the computation is approximated
a = 5
b = 7
print(a * b) # pure multiplicative
print(np.exp(np.log(a) + np.log(b))) # multiplicative made additive by log

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-jan-2022/sample_submission.csv")
submission.to_csv("submission.csv", index=False)

#### Did it work?
The notebook goes together with the EDA notebook, which visualizes the various seasonal effects and the differences in growth rate. Scikit-learn doesn't offer SMAPE as a loss function. As a workaround, I'm training for Huber loss with a transformed target, apply a correction factor, and we'll see how far we'll get.

The transformed target for the regression is the log of the sales numbers.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach? 
Look at a notebook which presents feature engineering (based on the insights of this EDA) and a linear model which makes use of the features.

Credit goes to :
- Festivities in Finland, Norway, Sweden (TSP 01-22)
- GDP 2015-2019: Finland, Norway, and Sweden

**Thanks for this datasets!**