# 1 Introduction

This EDA explores the data available for the Tabular Playground Series - January 2022 competition. Simple data exploration is performed, as well as preliminary modeling.

In [None]:
!pip install calplot

In [None]:
import pandas as pd
import numpy as np
import gc

train = pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-jan-2022/test.csv")

### 1.1 Column Breakdown

The first thing we should do is see what data we have, and how many examples to learn from.

In [None]:
def cat_column_info(column):
    num_categories = train[column].nunique()
    print("------> {} <------".format(column))
    print("--: train - type {}".format(train[column].dtype))
    print("--: test  - type {}".format(test[column].dtype))
    print("--: train - # categories {}".format(train[column].nunique()))
    print("--: test  - # categories {}".format(test[column].nunique()))
    if num_categories < 10:
        if train[column].dtype == "int64":
            print("--: train - values {}".format(np.sort(train[column].unique())))
            print("--: test  - values {}".format(np.sort(test[column].unique())))
        else:
            print("--: train - values {}".format(train[column].unique()))
            print("--: test  - values {}".format(test[column].unique()))
    print("--: train - NaN count {}".format(train[column].isnull().values.sum()))
    print("--: test  - NaN count {}".format(test[column].isnull().values.sum()))
    print("--: train - max value {}".format(train[column].max()))
    print("--: test  - max value {}".format(test[column].max()))
    print("--: train - min value {}".format(train[column].min()))
    print("--: test  - min value {}".format(test[column].min()))
    print("")

print(": Train shape {}".format(train.shape))
print(": Test shape {}".format(test.shape))
print("")

In terms of samples:
    
* Training set contains 26,298 samples
* Testing set contains 6,570 samples

# 2 Features

### 2.1 Feature Overview

Let's take a superficial dive into the columns for a moment to see what we are dealing with.

In [None]:
features = ['date', 'country', 'store', 'product']

for feature in features:
    cat_column_info(feature)

There are 5 columns of data that we have to work with:

* `row_id` - a unique identifier for that row - no-repetition of IDs.
* `date` - a date identifier in the form of `YYYY-MM-DD`.
* `country` - the country identifier - one of `Finland`, `Norway` or `Sweden`.
* `store` - the store identifier - one of `KaggleMart` or `KaggleRama`.
* `product` - the product identifier - one of `Kaggle Mug`, `Kaggle Hat`, or `Kaggle Sticker`.
* `num_sold` - the number of that particular unit sold.

### 2.2 Null Values

Let's check for null values:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')
sns_params = {"palette": "bwr_r"}

# Count the number of null values that occur in each row
train["null_count"] = train.isnull().sum(axis=1)

# Group the null counts
counts = train.groupby("null_count")["num_sold"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}

# Plot the null count results
pie, ax = plt.subplots(figsize=[5, 5])
colors = sns.color_palette("bwr_r")[0:5]
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Train Data)", fontsize=14)

In [None]:
# Count the number of null values that occur in each row
test["null_count"] = test.isnull().sum(axis=1)

# Group the null counts
counts = test.groupby("null_count")["null_count"].count().to_dict()
null_data = {"{} Null Value(s)".format(k) : v for k, v in counts.items() if k < 6}

# Plot the null count results
pie, ax = plt.subplots(figsize=[5, 5])
plt.pie(x=null_data.values(), autopct="%.2f%%", explode=[0.05]*len(null_data.keys()), labels=null_data.keys(), pctdistance=0.5, colors=colors)
_ = plt.title("Percentage of Null Values Per Row (Test Data)", fontsize=14)

### 2.3 `date` Feature

Let's take a look a little at the properties of the date feature.

In [None]:
print(": Training dates: {:,d} unique dates".format(train["date"].nunique()))
print(": Testing dates: {:,d} unique dates".format(test["date"].nunique()))

A few observations here that may be useful:

* The `date` ranges between the training set and the testing set are non-overlapping. The maximum `date` in the training set is _December 31, 2018_, while the minimum date in the testing set is _January 1, 2019_. This is important, as it means we are making predictions purely into the future - i.e. there is no interleaving between the training and testing dates.

* Each `date` appears 18 times. We can see this below:

    * Training: 26,298 samples / 1,461 unique dates = 18 samples per date
    * Testing: 6,570 samples / 365 unique dates = 18 samples per date


* The `date` appearing 18 times corresponds to each combination of `country`, `store`, and `product` sold on each day. The combinations are:

| `date` # | `country` | `store` | `product` |
| :-: | --------- | ------- | --------- |
| 1 | Finland   | KaggleMart | Kaggle Mug |
| 2 | Finland   | KaggleMart | Kaggle Hat |
| 3 | Finland   | KaggleMart | Kaggle Sticker |
| 4 | Finland   | KaggleRama | Kaggle Mug |
| 5 | Finland   | KaggleRama | Kaggle Hat |
| 6 | Finland   | KaggleRama | Kaggle Sticker |
| 7 | Norway | KaggleMart | Kaggle Mug |
| 8 | Norway | KaggleMart | Kaggle Hat |
| 9 | Norway | KaggleMart | Kaggle Sticker |
| 10 | Norway | KaggleRama | Kaggle Mug |
| 11 | Norway | KaggleRama | Kaggle Hat |
| 12 | Norway | KaggleRama | Kaggle Sticker |
| 13 | Sweden | KaggleMart | Kaggle Mug |
| 14 | Sweden | KaggleMart | Kaggle Hat |
| 15 | Sweden | KaggleMart | Kaggle Sticker |
| 16 | Sweden | KaggleRama | Kaggle Mug |
| 17 | Sweden | KaggleRama | Kaggle Hat |
| 18 | Sweden | KaggleRama | Kaggle Sticker |

We should check to see if the dates fully cover the start and ending periods.

In [None]:
train["date"]= pd.to_datetime(train["date"])
test["date"]= pd.to_datetime(test["date"])

In [None]:
import calplot

train.groupby("date")["row_id"].count()
_ = calplot.calplot(
    train.groupby("date")["row_id"].count(), 
    vmin=0, 
    vmax=18, 
    colorbar=True, 
    suptitle="Training Set - Number of Samples Per Day",
    suptitle_kws=dict(fontsize=20),
)

In [None]:
_ = calplot.calplot(
    test.groupby("date")["row_id"].count(), 
    vmin=0, 
    vmax=18, 
    colorbar=True, 
    suptitle="Testing Set - Number of Samples Per Day",
    suptitle_kws=dict(fontsize=20),
)

It appears that we have date data for every single day of every year covered contiguously for both the training and testing datasets.

### 2.4 Examining Sales by Date, Product, Store and Country

Let's see if there are any trends in our sales figures that are visible to a simple visual inspection.

In [None]:
for country in ["Finland", "Norway", "Sweden"]:
    df = pd.DataFrame(train[(train["country"] == country)])
    df["date1"] = df["date"]
    df.set_index(df["date"], inplace=True)
    df = df.groupby("date1")["num_sold"].sum()

    _ = calplot.calplot(
        df, 
        colorbar=True, 
        linewidth=0,
        edgecolor="black",
        linecolor="w",
        suptitle="Total Products Sold per Day in {}".format(country),
        suptitle_kws=dict(fontsize=20),
    )

A few observations:

* It appears that sales are influenced by the following holidays:
    * Easter (various clusters in April)
    * Father's Day (various clusters in November)
    * National Day of Sweden (clusters in June for Sweden)
    * Mother's Day in Sweden (clusters in May for Sweden)
* Sales around the end of the year are high (Christmas time)
* Sales on weekends appear to be higher than during weekdays

We should dig into this a little more.

### 2.5 Holidays vs Sales

Let's see if we can highlight these a little better with a different visualization. Below, we'll mark the following holidays:

* _Red dashed_ lines indicate Easter
* _Green dashed_ lines indicate Christmas
* _Purple dashed_ lines indicate Whit Sunday

In [None]:
import datetime
from matplotlib.lines import Line2D

ax = plt.subplots(figsize=[20, 10])
sweden_sales = pd.DataFrame(train[(train["country"] == "Sweden")])
sweden_sales["date1"] = sweden_sales["date"]
sweden_sales.set_index(sweden_sales["date"], inplace=True)
sweden_sales = sweden_sales.groupby("date1")["num_sold"].sum()
ax = sns.lineplot(data=sweden_sales)
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 6500, color="green", linestyle="--")
_ = ax.set_title("Total Number of Sales for All Sweden Kaggle Stores vs Holidays", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
sale_line = Line2D([], [], color=sns.color_palette()[0], label="All Sales, All Stores")
_ = plt.legend(handles=[easter_line, christmas_line, whit_sunday_line, sale_line])

As we can see here, in Sweden, sales bumps occur for:
    
* Easter
* Christmas - followed by a similar bump in New Year's Day
* Whit Sunday - although only in 2016 and 2017

Let's see if these extend to Finland:

In [None]:
ax = plt.subplots(figsize=[20, 10])
finland_sales = pd.DataFrame(train[(train["country"] == "Finland")])
finland_sales["date1"] = finland_sales["date"]
finland_sales.set_index(finland_sales["date"], inplace=True)
finland_sales = finland_sales.groupby("date1")["num_sold"].sum()
ax = sns.lineplot(data=finland_sales)
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 6500, color="green", linestyle="--")
_ = ax.set_title("Total Number of Sales for All Finland Kaggle Stores vs Holidays", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
sale_line = Line2D([], [], color=sns.color_palette()[0], label="All Sales, All Stores")
_ = plt.legend(handles=[easter_line, christmas_line, whit_sunday_line, sale_line])

In [None]:
ax = plt.subplots(figsize=[20, 10])
norway_sales = pd.DataFrame(train[(train["country"] == "Norway")])
norway_sales["date1"] = norway_sales["date"]
norway_sales.set_index(norway_sales["date"], inplace=True)
norway_sales = norway_sales.groupby("date1")["num_sold"].sum()
ax = sns.lineplot(data=norway_sales)
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 6500, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 6500, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 6500, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 6500, color="green", linestyle="--")
_ = ax.set_title("Total Number of Sales for All Norway Kaggle Stores vs Holidays", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
sale_line = Line2D([], [], color=sns.color_palette()[0], label="All Sales, All Stores")
_ = plt.legend(handles=[easter_line, christmas_line, whit_sunday_line, sale_line])

As we can see, the same subset of holidays carries forward across all three countries. We can draw a few conclusions:

* Creating features from holidays will likely result in better predictions.
* Given this is time-series data, providing the regressor with lag data for each of these holiday features may help in the prediction process.

Let's break out the sales by different product types and store types next to see if there are any patterns that are product and store specific.

#### 2.5.1 Norway Sales Data

In [None]:
ax = plt.subplots(figsize=[20, 10])
sales = pd.DataFrame(train[(train["country"] == "Norway") & (train["store"] == "KaggleMart")])
ax = sns.lineplot(data=sales, x="date", y="num_sold", hue="product")
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 1250, color="green", linestyle="--")
_ = ax.set_title("Product Sales for Norway KaggleMart by Product", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
legend_handles, legend_labels = plt.gca().get_legend_handles_labels()
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
quarter_line = Line2D([], [], color="gray", linestyle=":", label="Quarter End / Start")
legend_handles.extend([easter_line, christmas_line, whit_sunday_line, quarter_line])
_ = plt.legend(handles=legend_handles)

In [None]:
ax = plt.subplots(figsize=[20, 10])
sales = pd.DataFrame(train[(train["country"] == "Norway") & (train["store"] == "KaggleRama")])
ax = sns.lineplot(data=sales, x="date", y="num_sold", hue="product")
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 1250, color="green", linestyle="--")
_ = ax.set_title("Product Sales for Norway KaggleRama by Product", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
legend_handles, legend_labels = plt.gca().get_legend_handles_labels()
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
quarter_line = Line2D([], [], color="gray", linestyle=":", label="Quarter End / Start")
legend_handles.extend([easter_line, christmas_line, whit_sunday_line, quarter_line])
_ = plt.legend(handles=legend_handles)

In general, for both `KaggleMart` and `KaggleRama`, we see the following trends:

* Sales for `Kaggle Hat` increase in the first quarter, decrease for the second and third quarters, and then increase in the fourth quarter.
* Sales for `Kaggle Mug` decrease in the first and second quarters, and then slowly increase in the third and fourth quarters.
* Sales for `Kaggle Sticker` remain flat based on quarter.
* Sales for all three products spike during Christmas, Easter, and Whit Sunday.
* The `Kaggle Hat` appears to have the greatest fluctuation in sales numbers day to day, followed by `Kaggle Mug`, and finally `Kaggle Sticker`.
* `KaggleRama` generally outperforms `KaggleMart`.

#### 2.5.2 Sweden Sales Data

In [None]:
ax = plt.subplots(figsize=[20, 10])
sales = pd.DataFrame(train[(train["country"] == "Sweden") & (train["store"] == "KaggleMart")])
ax = sns.lineplot(data=sales, x="date", y="num_sold", hue="product")
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 1250, color="green", linestyle="--")
_ = ax.set_title("Product Sales for Sweden KaggleMart by Product", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
legend_handles, legend_labels = plt.gca().get_legend_handles_labels()
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
quarter_line = Line2D([], [], color="gray", linestyle=":", label="Quarter End / Start")
legend_handles.extend([easter_line, christmas_line, whit_sunday_line, quarter_line])
_ = plt.legend(handles=legend_handles)

In [None]:
ax = plt.subplots(figsize=[20, 10])
sales = pd.DataFrame(train[(train["country"] == "Sweden") & (train["store"] == "KaggleRama")])
ax = sns.lineplot(data=sales, x="date", y="num_sold", hue="product")
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 1250, color="green", linestyle="--")
_ = ax.set_title("Product Sales for Sweden KaggleRama by Product", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
legend_handles, legend_labels = plt.gca().get_legend_handles_labels()
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
quarter_line = Line2D([], [], color="gray", linestyle=":", label="Quarter End / Start")
legend_handles.extend([easter_line, christmas_line, whit_sunday_line, quarter_line])
_ = plt.legend(handles=legend_handles)

In general, for both `KaggleMart` and `KaggleRama`, we see the following trends:

* Sales for `Kaggle Hat` increase in the first quarter, decrease for the second and third quarters, and then increase in the fourth quarter.
* Sales for `Kaggle Mug` decrease in the first and second quarters, and then slowly increase in the third and fourth quarters.
* Sales for `Kaggle Sticker` remain flat based on quarter.
* Sales for all three products spike during Christmas, Easter, and Whit Sunday.
* The `Kaggle Hat` appears to have the greatest fluctuation in sales numbers day to day, followed by `Kaggle Mug`, and finally `Kaggle Sticker`.
* `KaggeMart` generally outperforms `KaggleRama`.

#### 2.5.3 Finland Sales Data

In [None]:
ax = plt.subplots(figsize=[20, 10])
sales = pd.DataFrame(train[(train["country"] == "Finland") & (train["store"] == "KaggleMart")])
ax = sns.lineplot(data=sales, x="date", y="num_sold", hue="product")
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 1250, color="green", linestyle="--")
_ = ax.set_title("Product Sales for Finland KaggleMart by Product", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
legend_handles, legend_labels = plt.gca().get_legend_handles_labels()
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
quarter_line = Line2D([], [], color="gray", linestyle=":", label="Quarter End / Start")
legend_handles.extend([easter_line, christmas_line, whit_sunday_line, quarter_line])
_ = plt.legend(handles=legend_handles)

In [None]:
ax = plt.subplots(figsize=[20, 10])
sales = pd.DataFrame(train[(train["country"] == "Finland") & (train["store"] == "KaggleRama")])
ax = sns.lineplot(data=sales, x="date", y="num_sold", hue="product")
ax.grid(False)
_ = plt.axvline(datetime.date(2015, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2016, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2017, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 1, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 7, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2018, 10, 1), 0, 1250, color="gray", linestyle=":")
_ = plt.axvline(datetime.date(2015, 4, 5), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2015, 5, 24), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2015, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2016, 3, 27), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2016, 5, 15), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2016, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 12, 25), 0, 1250, color="green", linestyle="--")
_ = plt.axvline(datetime.date(2017, 4, 16), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2017, 6, 4), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 4, 1), 0, 1250, color="red", linestyle="--")
_ = plt.axvline(datetime.date(2018, 5, 20), 0, 1250, color="purple", linestyle="--")
_ = plt.axvline(datetime.date(2018, 12, 25), 0, 1250, color="green", linestyle="--")
_ = ax.set_title("Product Sales for Finland KaggleRama by Product", fontsize=15)
_ = ax.set_ylabel("Number Sold", fontsize=15)
_ = ax.set_xlabel("Date", fontsize=15)
legend_handles, legend_labels = plt.gca().get_legend_handles_labels()
easter_line = Line2D([], [], color="red", linestyle="--", label="Easter")
christmas_line = Line2D([], [], color="green", linestyle="--", label="Christmas")
whit_sunday_line = Line2D([], [], color="purple", linestyle="--", label="Whit Sunday")
quarter_line = Line2D([], [], color="gray", linestyle=":", label="Quarter End / Start")
legend_handles.extend([easter_line, christmas_line, whit_sunday_line, quarter_line])
_ = plt.legend(handles=legend_handles)

In general, for both `KaggleMart` and `KaggleRama`, we see the following trends:

* Sales for `Kaggle Hat` increase in the first quarter, decrease for the second and third quarters, and then increase in the fourth quarter.
* Sales for `Kaggle Mug` decrease in the first and second quarters, and then slowly increase in the third and fourth quarters.
* Sales for `Kaggle Sticker` remain flat based on quarter.
* Sales for all three products spike during Christmas, Easter, and Whit Sunday.
* The `Kaggle Hat` appears to have the greatest fluctuation in sales numbers day to day, followed by `Kaggle Mug`, and finally `Kaggle Sticker`.
* `KaggleRama` generally outperforms `KaggleMart`.

## 2.6 Overall Observations

Some observations about the data so far that may help us engineer features:

* Including the holidays of Christmas, Easter, and Whit Sunday as individual features may help boost performance.
* Marking weekends and weekdays as features may help boost performance.
* Including the quarter number as a feature may help boost performance.
* Given that this is time series data, creating lag features for the above mentioned features may help boost performance - experimentation may be needed to find the best lag value.

Other observations:

* `KaggleRama` outperforms `KaggleMart` in `Finland` and `Norway`, but not in `Sweden`.
* In every location and in each store type, `Kaggle Hat` is the best seller, followed by `Kaggle Mug` and then `Kaggle Sticker`. 

# More to come...