The effect of holidays seems complicated but they are actually simple superpositions of a single kernel (and an additional one for Christmas). This notebook is the 5th-place solution with some visualizations.

Minimum linear regression (5th-place):
https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/304369

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime

from sklearn.linear_model import LinearRegression

In [None]:
di = '/kaggle/input/tabular-playground-series-jan-2022/'
train = pd.read_csv(di + 'train.csv')
test = pd.read_csv(di + 'test.csv')
submit = pd.read_csv(di + 'sample_submission.csv')

# Holidays

Official holidays depend on countries, and there are minor differences around big holidays such as Maundy Thursday in Easter and Christmas Eve.

From Holiday data by DrCapa:
https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298243

In [None]:
import io

holiday_table = pd.read_csv(io.StringIO(
'''
name,2014,2015,2016,2017,2018,2019,2020,Finland,Norway,Sweden,Fixed
New Year's Day,,01-01,01-01,01-01,01-01,01-01,01-01,✔︎,✔︎,✔︎,✔︎
Epiphany,,01-06,01-06,01-06,01-06,01-06,01-06,✔︎,,✔︎,✔︎
Maundy Thursday,,04-02,03-24,04-13,03-29,04-18,,,✔︎,,
Good Friday,,04-03,03-25,04-14,03-30,04-19,,✔︎,✔︎,✔︎,
Easter Sunday,,04-05,03-27,04-16,04-01,04-21,,✔︎,✔︎,✔︎,
Easter Monday,,04-06,03-28,04-17,04-02,04-22,,✔︎,✔︎,✔︎,
May Day,,05-01,05-01,05-01,05-01,05-01,,✔︎,✔︎,✔︎,✔︎
Ascension Day,,05-14,05-05,05-25,05-10,05-30,,✔︎,✔︎,✔︎,
Constitution Day,,05-17,05-17,05-17,05-17,05-17,,,✔︎,,✔︎
Pentecost,,05-24,05-15,06-04,05-20,06-09,,✔︎,✔︎,✔︎,
Whit Monday,,05-25,05-16,06-05,05-21,06-10,,,✔︎,,
National Day of Sweden,,06-06,06-06,06-06,06-06,06-06,,,,✔︎,✔︎
Midsummer Eve,,06-19,06-24,06-23,06-22,06-21,,✔︎,,✔︎,
Midsummer Day,,06-20,06-25,06-24,06-23,06-22,,✔︎,,✔︎,
All Saints' Day,,10-31,11-05,11-04,11-03,11-02,,✔︎,,✔︎,
Independence Day,,12-06,12-06,12-06,12-06,12-06,,✔︎,,,✔︎
Christmas Eve,12-24,12-24,12-24,12-24,12-24,12-24,,✔︎,,✔︎,✔︎
Christmas Day,12-25,12-25,12-25,12-25,12-25,12-25,,✔︎,✔︎,✔︎,✔︎
St. Stephen's Day,12-26,12-26,12-26,12-26,12-26,12-26,,✔︎,✔︎,✔︎,✔︎
New Year's Eve,12-31,12-31,12-31,12-31,12-31,12-31,,,,✔︎,✔︎
''')).fillna('')

In [None]:
holiday_table

In [None]:
def create_holidays(df):
    """
    Returns list of holiday information
      dt (np.array): days from the given holiday to date in df
    """
    holidays = []
    
    # Date, e.g., 1461 days in train
    t = pd.to_datetime(df['date'].values.reshape(-1, 18)[:, 0])

    # For 19 official holidays in the table presented above
    for i, r in holiday_table.iterrows():
        dt = None
    
        for year in range(2014, 2021):
            md = r[str(year)]

            if md == '':
                continue

            # Days relative to the holiday
            t0 = pd.to_datetime('%d-%s' % (year, md))
            dt_this_year = (t - t0).days
            
            # Select smallest dt in abs()
            if dt is None:
                dt = dt_this_year
            else:
                dt = np.where(np.abs(dt) < np.abs(dt_this_year), dt, dt_this_year)    

        # 0 or 1 if the holiday applies to each country
        country = np.array(r[['Finland', 'Norway', 'Sweden']].values == '✔︎', dtype=int)

        d = {'name': r['name'],
             'dt': dt,            # time difference in days
             'country': country,  # binary flag if 
            }
    
        holidays.append(d)

    return holidays

## GDP

Thanks to Carl McBride Ellis for discovering and sharing: https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298911

In [None]:
countries = train['country'].unique()

def add_gdp(df):
    """
    Add GDP column to train/test
    """
    g = pd.read_csv('/kaggle/input/gdp-20152019-finland-norway-and-sweden/'
                    'GDP_data_2015_to_2019_Finland_Norway_Sweden.csv')
    
    # Convert to year, country, GDP format for pd.merge
    dfs = []
    for country in countries:
        dfs.append(pd.DataFrame({'year': g['year'],
                                 'country': [country, ] * len(g),
                                 'GDP': g['GDP_' + country]}))
    gdp = pd.concat(dfs)

    df['datetime'] = pd.to_datetime(df['date'])
    df['year'] = df['datetime'].dt.year    
    df = pd.merge(df, gdp, how='left', on=['year', 'country'])
    
    return df

# Features

1. Basic features from AmbrosM: https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model
   * 7 features for weekday, country, store, and product
2. Fourier serise reduced to pure cosine for Mug and sine for Hat; 2 features
3. Holiday boost for 10 days after the holiday
   * 10 binary flags for "today is n (0 ≦ n < 10) days after holiday"
   * Similar flag for 10 days after Christmas
   * Linear coefficients (kernel) × features give the superpositions of the kernel

In [None]:
def create_features(df):
    # Features
    X = pd.DataFrame()
    
    # Target variable: y = log(num_sold/GDP)   
    df = add_gdp(df)    
 
    if 'num_sold' in df.columns:
        y = np.log(df['num_sold'] / df['GDP']).values
    else:
        y = None
    
    # Weekday
    wd = df['datetime'].dt.weekday
    X['friday'] = wd == 4
    X['weekend'] = wd >= 5
    
    # Country factor
    X['Norway'] = df['country'] == 'Norway'
    X['Sweden'] = df['country'] == 'Sweden'

    # Product factor
    X['Hat'] = df['product'] == 'Kaggle Hat'
    X['Sticker'] = df['product'] == 'Kaggle Sticker'

    # Store factor
    X['Rama'] = df['store'] == 'KaggleRama'

    X = X.astype(float)
    
    # Sinusoidal
    date = df['date'].values.reshape(-1, 18)[:, 0]  # 1461 dates (train)
    dt = pd.to_datetime(date)
    year = dt.year.values
    
    n = len(date)
    daymax = 365 * np.ones(n)
    daymax[year % 4 == 0] = 366
    
    theta = 2.0 * math.pi * (dt.dayofyear.values - 1) / daymax  # phase in [0, 2pi)

    # time x country x store x product
    cos = np.zeros((n, 3, 2, 3))  
    cos[:, :, :, 0] = np.cos(theta).reshape(-1, 1, 1)  # cos for product 0
    X['cos'] = cos.flatten()

    sin = np.zeros((n, 3, 2, 3))
    sin[:, :, :, 1] = np.sin(theta).reshape(-1, 1, 1)  # sin for product 1
    X['sin'] = sin.flatten()

    # Holidays
    holidays = create_holidays(df)
    days = 10

    # Holiday flag for ith day after holiday
    for i in range(days):
        holiday = np.zeros((n, 3), dtype=int)  # date x country

        for h in holidays:
            # dt is days after the given holidays, t - t_holiday
            holiday += (h['dt'].reshape(n, 1) == i) * h['country']

        # Same for 2 store x 3 products
        X['holiday-%d' % i] = holiday.flatten().repeat(6)

    # Additional flag for Christmas
    h = holidays[-3]

    for i in range(days):
        holiday = h['dt'] == i
        X['christmas-%d' % i] = holiday.flatten().repeat(18)

    return X, y

X, y = create_features(train)
X.columns

# Fit

29 coefficients and 1 bias.

In [None]:
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

In [None]:
def smapey(y_true, y_pred):
    """
    SMAPE(num_sold_true, num_sold_pred) from y = log(num_sold/GDP)
    """
    num_ratio = np.exp(y_pred - y_true)  # = num_sold_pred / num_sold_true
    return 200 * np.mean(np.abs(num_ratio - 1) / (num_ratio + 1))

smapey(y, y_pred)

In [None]:
y_residual = y - y_pred

plt.figure(figsize=(12, 3))
plt.title('Residual averaged over 18 catecories')
plt.xlabel('time [day]')
plt.ylabel('y_residual')
plt.plot(np.mean(y_residual.reshape(-1, 18), axis=1))
plt.axhline(0, color='gray', alpha=0.5)
plt.show()

# Plotting holidays


## Kernel
The holiday kernel is pretty Gaussian.

In [None]:
from scipy.optimize import curve_fit

def gauss(x, A, m, s):
    return A*np.exp(-((x - m) / s)** 2)

x = np.arange(10)
coef = model.coef_[9:19]
A = np.max(coef)
popt, _ = curve_fit(gauss, x, coef, p0=[A, 5, 2])
print('Gaussian fit: A, μ, σ =', popt)

xx = np.linspace(-1, 10, 101)
plt.title('Holiday kernel')
plt.plot(model.coef_[9:19], 'x', label='Holiday coef')
plt.plot(xx, gauss(xx, *popt), label='Gaussian')
plt.xlabel('days after holiday')
plt.ylabel('coef[9:19]')
plt.legend(frameon=False)
plt.show()

# Chrismas
coef = model.coef_[19:29]
poptc, _ = curve_fit(gauss, x, coef, p0=[A, 5, 2])
print('Gaussian fit: A, μ, σ =', poptc)

plt.title('Chrismas kernel')
plt.plot(coef, 'x', label='Christmas coef')
plt.plot(xx, gauss(xx, *poptc), label='Gaussian')
plt.xlabel('days after holiday')
plt.ylabel('coef[19:29]')
plt.legend(frameon=False)
plt.show()

## Christmas

In [None]:
holidays = create_holidays(train)
cmap = plt.get_cmap('tab10')

# Correct 9 basic factors and leave holidays
yy = y - np.matmul(X.values[:, :9], model.coef_[:9]) - model.intercept_
yy = np.mean(yy.reshape(-1, 3, 6), axis=2)  # average over 2 stores x 3 products

y_holiday = np.matmul(X.values[:, 9:], model.coef_[9:]).reshape(-1, 3, 6)

plt.figure(figsize=(8, 4))
plt.title('2015 Chrismas to 2016 New Year')
plt.ylabel('y')
plt.xlabel('time [day]')

x = np.arange(350, 380)
for i, country in enumerate(countries):
    plt.plot(x, yy[350:380, i], 'x', color=cmap(i))
    plt.plot(x, y_holiday[350:380, i, 0], color=cmap(i), label=country)

plt.legend(frameon=False)
plt.show()

plt.figure(figsize=(8, 8))
xx = np.linspace(350, 379, 101)
for i, country in enumerate(countries):
    plt.subplot(3, 1, i + 1)
    plt.ylabel(country)
    plt.plot(x, yy[350:380, i], 'x', color=cmap(i))
    #plt.plot(x, y_holiday[350:380, i, 0], color=cmap(i), label=country)
    y_sum = np.zeros(101)
    
    for h in holidays:
        dt = h['dt'][350:380]
        
        if h['country'][i] and np.min(np.abs(dt)) < 10:
            dt = np.linspace(dt[0], dt[-1], 101)
            y_gauss = gauss(dt, *popt)
            if h['name'] == 'Christmas Day':
                y_gauss += gauss(dt, *poptc)
            plt.plot(xx, y_gauss, color=cmap(i))
            y_sum += y_gauss
        
    plt.plot(xx, y_sum, color=cmap(i), alpha=0.5, label='sum of holidays')
    if i == 0:
        plt.legend(frameon=False)

The difference among countries can be explaind by the difference in official holidays, e.g., 

- Norway does not have Chrismas Eve and has smaller peak,
- Finland and Sweden have Epiphany (6 Jan) while Norway do not,

and only 2 kernels are necessary.
 
Points are data corrected for basic factors (weekday, store, and product including sinusoidal) and averaged over store and product. Lines are the model.

December and January are still easy because the holiday dates are fixed.

## May

May is more challenging because May Day and Constitution Day are fixed, while Ascension Day, Pentecost, and Whit Monday are not and these two types overlap.

In [None]:
plt.figure(figsize=(8, 8))
plt.suptitle('May in Norway')

for i in range(4):
    may = slice(365 * i + 120, 365 * i + 150)
    x = np.arange(may.start, may.stop)
    plt.subplot(4, 1, i + 1)
    year = 2015 + i
    if i == 3:
        plt.xlabel('time [day]')
    plt.ylabel(year)
    plt.plot(x, yy[may, 1], 'x')
    plt.plot(x, y_holiday[may, 1, 0])
    
    
    xx = np.linspace(may.start, may.stop - 1, 101)    
    y_sum = np.zeros(101)
        
    for h in holidays:
        dt = h['dt'][may]
        
        if h['country'][1] and np.min(np.abs(dt)) < 10:
            dt = np.linspace(dt[0], dt[-1], 101)
            y_gauss = gauss(dt, *popt)
            plt.plot(xx, y_gauss, color='gray')
            y_sum += y_gauss

plt.show()

# Submit prediction

There is much room for improvement, but leave it where I stopped. The main topic is my method for holidays.

In [None]:
X_test, _ = create_features(test)
y_test = model.predict(X_test)
df = add_gdp(test)

submit['num_sold'] = np.exp(y_test) * df['GDP']
submit.to_csv('submission.csv', index=False)