#### What are you trying to do in this notebook?
For this challenge, we will be predicting a full year worth of sales for three items at two stores located in three different countries. This dataset is completely fictional, but contains many effects you see in real-world data, e.g., weekend and holiday effect, seasonality, etc. The dataset is small enough to allow us to try numerous different modeling approaches.

#### Why are you trying it?
There are two (fictitious) independent store chains selling Kaggle merchandise that want to become the official outlet for all things Kaggle. we want to figure out which of the store chains(KaggleMart or KaggleRama) would have the best sales going forward.

**Files**
- train.csv - the training set, which includes the sales data for each date-country-store-item combination.
- test.csv - the test set; your task is to predict the corresponding item sales for each date-country-store-item combination. Note the Public leaderboard is scored on the first quarter of the test year, and the Private on the remaining.
- sample_submission.csv - a sample submission file in the correct format.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import pickle
import itertools
import gc
import math
import matplotlib.pyplot as plt
import dateutil.easter as easter
from matplotlib.ticker import MaxNLocator, FormatStrFormatter, PercentFormatter
from datetime import datetime, date
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, HuberRegressor

In [None]:
original_train_df = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv')
original_test_df = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv')

# The dates are read as strings and must be converted
for df in [original_train_df, original_test_df]:
    df['date'] = pd.to_datetime(df.date)
    df.set_index('date', inplace=True, drop=False)
original_train_df.head(2)

In [None]:
def smape_loss(y_true, y_pred):
    """SMAPE Loss"""
    return np.abs(y_true - y_pred) / (y_true + np.abs(y_pred)) * 200

#print(smape_loss(tf.constant([1, 2]), tf.constant([3, 4]))) # should print [100, 66.6667]

In [None]:
# Feature engineering
def engineer(df):
    """Return a new dataframe with the engineered features"""
    new_df = pd.DataFrame({'year': df.date.dt.year, # This feature makes it possible to fit an annual growth rate
                           'dayofyear': df.date.dt.dayofyear,
                           'wd4': df.date.dt.weekday == 4, # Friday
                           'wd56': df.date.dt.weekday >= 5, # Saturday and Sunday
                           'dec29': (df.date.dt.month == 12) & (df.date.dt.day == 29), # end-of-year peak
                           'dec30': (df.date.dt.month == 12) & (df.date.dt.day == 30),
                          })

    # Easter
    new_df['easter_week'] = False
    for year in range(2015, 2020):
        easter_date = easter.easter(year)
        easter_diff = df.date - np.datetime64(easter_date)
        new_df['easter_week'] = new_df['easter_week'] | (easter_diff > np.timedelta64(0, "D")) & (easter_diff < np.timedelta64(8, "D"))
    
    # Growth is country-specific
    #for country in ['Finland', 'Norway', 'Sweden']:
    #    new_df[f"{country}_year"] = (df.country == country) * df.date.dt.year
        
    # One-hot encoding (no need to encode the last categories)
    for country in ['Finland', 'Norway']:
        new_df[country] = df.country == country
    new_df['KaggleRama'] = df.store == 'KaggleRama'
    for product in ['Kaggle Mug', 'Kaggle Sticker']:
        new_df[product] = df['product'] == product
        
    # Seasonal variations (Fourier series)
    # The three products have different seasonal patterns
    dayofyear = df.date.dt.dayofyear
    for k in range(1, 100): # 100
        new_df[f'sin{k}'] = np.sin(dayofyear / 365 * 2 * math.pi * k)
        new_df[f'cos{k}'] = np.cos(dayofyear / 365 * 2 * math.pi * k)
        new_df[f'mug_sin{k}'] = new_df[f'sin{k}'] * new_df['Kaggle Mug']
        new_df[f'mug_cos{k}'] = new_df[f'cos{k}'] * new_df['Kaggle Mug']
        new_df[f'sticker_sin{k}'] = new_df[f'sin{k}'] * new_df['Kaggle Sticker']
        new_df[f'sticker_cos{k}'] = new_df[f'cos{k}'] * new_df['Kaggle Sticker']

    return new_df

train_df = engineer(original_train_df)
train_df['date'] = original_train_df.date
train_df['num_sold'] = original_train_df.num_sold.astype(np.float32)
test_df = engineer(original_test_df)
test_df.year = 2018 # no growth patch, see https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298318

features = test_df.columns

for df in [train_df, test_df]:
    df[features] = df[features].astype(np.float32)
print(list(features))

In [None]:
#%%time
RUNS = 1 # should be 1. increase the number of runs only if you want see how the result depends on the random seed
OUTLIERS = True
TRAIN_VAL_CUT = datetime(2018, 1, 1)
LOSS_CORRECTION = 1 # correction factor between Huber loss and SMAPE: 1.035 ( for linear regression with MSE use 1.038)

def fit_model(X_tr, X_va=None):
    """Scale the data, fit a model, plot the training history and validate the model"""
    start_time = datetime.now()

    # Preprocess the data
    X_tr_f = X_tr[features]
    preproc = StandardScaler()
    X_tr_f = preproc.fit_transform(X_tr_f)
    y_tr = X_tr.num_sold.values.reshape(-1, 1)
    
    # Train the model
    #model = LinearRegression() # 5.80558
    model = HuberRegressor(epsilon=1.20) # 5.80143 (epsilon=1.20) ******************
    model.fit(X_tr_f, np.log(y_tr))

    if X_va is not None:
        # Preprocess the validation data
        X_va_f = X_va[features]
        X_va_f = preproc.transform(X_va_f)
        y_va = X_va.num_sold.values.reshape(-1, 1)

        # Inference for validation
        y_va_pred = np.exp(model.predict(X_va_f)).reshape(-1, 1)
        
        # Evaluation: Execution time and SMAPE
        smape_before_correction = np.mean(smape_loss(y_va, y_va_pred))
        y_va_pred *= LOSS_CORRECTION
        smape = np.mean(smape_loss(y_va, y_va_pred))
        print(f"Fold {run}.{fold} | {str(datetime.now() - start_time)[-12:-7]}"
              f" | SMAPE: {smape:.5f}   (before correction: {smape_before_correction:.5f})")
        
        # Plot y_true vs. y_pred
        plt.figure(figsize=(10, 10))
        plt.scatter(y_va, y_va_pred, s=1, color='r')
        #plt.scatter(np.log(y_va), np.log(y_va_pred), s=1, color='g')
        plt.plot([plt.xlim()[0], plt.xlim()[1]], [plt.xlim()[0], plt.xlim()[1]], '--', color='k')
        plt.gca().set_aspect('equal')
        plt.xlabel('y_true')
        plt.ylabel('y_pred')
        plt.title('OOF Predictions')
        plt.show()

        # Show the outliers among the predictions
        if OUTLIERS:
            print("Outlier predictions - work on these to improve your score!")
            outliers = original_train_df.iloc[val_idx].copy()
            outliers['smape'] = smape_loss(y_va, y_va_pred)
            with pd.option_context("display.max_rows", 1000, "display.width", 160):
                print(outliers.sort_values('smape', ascending=False).head(120).sort_values('row_id'))
        
    return preproc, model

# Make the results reproducible
np.random.seed(202100)

total_start_time = datetime.now()
for run in range(RUNS):
    fold = 0
    train_idx = np.arange(len(train_df))[train_df.date < TRAIN_VAL_CUT]
    val_idx = np.arange(len(train_df))[train_df.date > TRAIN_VAL_CUT]
    print(f"Fold {run}.{fold}")
    X_tr = train_df.iloc[train_idx]
    X_va = train_df.iloc[val_idx]
    
    preproc, model = fit_model(X_tr, X_va)

In [None]:
def plot_demo(country='Norway', store='KaggleMart', product='Kaggle Hat'):
    demo_df = pd.DataFrame({'row_id': 0,
                            'date': pd.date_range('2015-01-01', '2019-12-31', freq='D'),
                            'country': country,
                            'store': store,
                            'product': product})
    demo_df.set_index('date', inplace=True, drop=False)
    demo_df = engineer(demo_df)
    demo_df['num_sold'] = np.exp(model.predict(preproc.transform(demo_df[features]))) * LOSS_CORRECTION
    plt.figure(figsize=(18, 6))
    plt.plot(np.arange(len(demo_df)), demo_df.num_sold, label='prediction')
    train_subset = train_df[(original_train_df.country == country) & (original_train_df.store == store) & (original_train_df['product'] == product)]
    plt.plot(np.arange(len(train_subset)), train_subset.num_sold, label='true', alpha=0.5)
    plt.legend()
    plt.show()

plot_demo()

In [None]:
# Fit the model on the complete training data
train_idx = np.arange(len(train_df))
X_tr = train_df.iloc[train_idx]
preproc, model = fit_model(X_tr, None)

plot_demo()

# Inference for test
test_pred_list = []
test_pred_list.append(np.exp(model.predict(preproc.transform(test_df[features]))) * LOSS_CORRECTION)

if len(test_pred_list) > 0:
    # Create the submission file
    sub = original_test_df[['row_id']].copy()
    sub['num_sold'] = sum(test_pred_list) / len(test_pred_list)
    sub.to_csv('submission.csv', index=False)

    # Plot the distribution of the test predictions
    plt.figure(figsize=(16,3))
    plt.hist(train_df['num_sold'], bins=np.linspace(0, 3000, 201), density=True, label='Training')
    plt.hist(sub['num_sold'], bins=np.linspace(0, 3000, 201), density=True, rwidth=0.5, label='Test predictions')
    plt.xlabel('num_sold')
    plt.ylabel('Frequency')
    plt.legend()
    plt.show()

In [None]:
sub

#### Did it work?
The notebook goes together with the EDA notebook, which visualizes the various seasonal effects and the differences in growth rate.
Scikit-learn doesn't offer SMAPE as a loss function. As a workaround, I'm training for Huber loss with a transformed target, apply a correction factor, and we'll see how far we'll get.

The transformed target for the regression is the log of the sales numbers.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
Look at a notebook which presents feature engineering (based on the insights of this EDA) and a linear model which makes use of the features.

**I Hope you find this notebook useful , Good Luck!**