### **Kaggle sales** 
###### **(If you know about to data, Skip this part. Please move to the introduction)**<br />
There are two (fictitious) independent store chains selling Kaggle merchandise that want to become the official outlet for all things Kaggle. They've decided to see if the Kaggle community could help us figure out which of the store chains would have the best sales going forward. So, They've collected some data and are asking us to build forecasting models to help them decide.

They want us to help with figuring out whether KaggleMart or KaggleRama should become the official Kaggle outlet!

Understanding the data might help us see the problem more clearly. The data is well managed(nothing to complaint about), perfectly captured(Thanks to kaggle). These type of data are helpful in practising time series analysis for beginners who want to puch their levels. Let's fall back to the data,

##### **Features**
* **row_id** - Representing individual **id** for each row
* **date** - Date of the sales **(Year-Month-Day)**
* **country** - Holding three unique values where the sales are recorded **[Finland, Norway, Sweden]**
* **store** - Denoting the store chain which kaggle represented **[KaggleMart, KaggleRama]**
* **product** - Various products in stores **[Kaggle Mug, Kaggle Hat, Kaggle Sticker]**
* **num_sold** - Total number of respective product sold in that **particular day**

### **Introduction**
This notebook is only focused on EDA of this particular data. Although, it's not same as other EDA. The plots used in this notebooks are found to be better visualized and explained. Just a simple line plot with few improvements mayhelp us see the data more clearly than before.

##### **Table of contents**
* Periodic total sales of kaggle products *(averaging all countries, stores and products)*
    * Reviewing seasonality
* Periodic total sales of kaggle products through countries *(averaging all stores and products)*
    * Country influence
    * Semi-Annual view of each country sales
    

### **Total Sales**
The sales are spreaded into three different countries and different products. By observing the data more clearly we can understand that there are **18 entries in same date**. These duplications*(not officially)* of date are caused by the combinations of **3 countries, 2 stores and 3 products (3*2*3=18 entries)**. The below figure is sum of all these 18 entries in same date showing the total sales of kaggle products.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MonthLocator
import seaborn as sns
import matplotlib.gridspec as gridspec
from statsmodels.tsa.seasonal import seasonal_decompose

import warnings
warnings.simplefilter('ignore')

from datetime import datetime

dataset = pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv")
total_sales = dataset.loc[:, ["date", "num_sold"]].groupby("date").sum()
total_sales.index = pd.to_datetime(total_sales.index)

fig, ax = plt.subplots(figsize=(20, 3))
ax.plot(total_sales, label="Total Sales", c="#2B2D42", lw=1)
ax.margins(x=0)
ax.legend(loc=2, edgecolor="#FFF")
for label in ax.xaxis.get_ticklabels():
    label.set_horizontalalignment('left')

##### **Reviewing seasonality**
It doesn't need to be detailed, we can diretly find the seasonality. Anyhow, lets dig more through the plot. From the above section we can simply say the bigger sales are at the start of the year(approximately could be around the end of the year). The **highest sale** is on **29th december 2018**, simply assuming the sale would be lot bigger than this count next year.

Seasonality, would be more important task in this data. We could find a lot more seasonality just by viewing the data plot. But, the biggest problem is to find whether the data is showing additive or multiplicative seasonality. On sense, our data is showing **low variations through large period** to put it under multiplicative, at the same time, it could be increasing slightly every year through multiplicative. Hence, we trapped under a confusion. But, we have to choose one here.

I'm gonna make a decision as **multiplicative seasonality**, as we could possibly find more sales in future years.

In [None]:
decompose_result_mult = seasonal_decompose(total_sales, model="multiplicative")

trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
# trend.plot(figsize=(20, 3))
# seasonal.plot(figsize=(20, 3), c="#2B2D42")

def plot_periodogram(ts, detrend='linear', ax=None):
    from scipy.signal import periodogram
    fs = pd.Timedelta("1Y") / pd.Timedelta("1D")
    freqencies, spectrum = periodogram(
        ts,
        fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots(figsize = (20, 3))
#     print(freqencies)
#     print(spectrum)
    ax.step(freqencies, spectrum, color="#2B2D42")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
    ax.set_xticklabels(
        [
            "A",
            "SA",
            "Q",
            "BM",
            "M",
            "BW",
            "W",
            "SW",
        ],
        rotation=0,
    )
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Variance")
    ax.set_title("Periodogram")
    ax.text(max(freqencies)-150, max(spectrum), 
            """
            A  - Annual (1)
            SA - Semi Annual (2)
            Q  - Quarterly (4)
            BM - Bi Monthly (6)
            M  - Monthly (12)
            BW - Bi Weekly (26)
            W  - Weekly (52)
            SW - Semi Weekly (104)
            """, ha="left", va="top")
    return ax



fig = plt.figure(tight_layout=True, figsize=(20, 5))
gs = gridspec.GridSpec(2, 2)

ax1 = fig.add_subplot(gs[0, 0])
ax2 = fig.add_subplot(gs[1, 0])
ax3 = fig.add_subplot(gs[0:, 1])
ax1.plot(trend, c="#2B2D42", label="Trend")
ax2.plot(seasonal, c="#2B2D42", label="Seasonal")
for i in [ax1, ax2, ax3]:
    i.margins(x=0)
    i.legend(loc=2, edgecolor="#FFF")
plot_periodogram(total_sales.num_sold, ax=ax3);

plt.show()

### **Total Sales through locations**
Three locations are invloved in the data **[Finland, Norway, Sweden]**. These locations has it's part in sales in that area. For being detail, there is a visualization below showing **Norway is a better seller** than other two locations. **Finland seems to be the lowest** of them all. These differences is more useful in recognizing the place where sales could be focused more.

The below visualization also iunclude **semi-annual view** of all locations [October 2017 - March 2018], which zooms the data more clearly on a single peak.

In [None]:
finland_dataset = dataset.loc[dataset["country"]=="Finland", :]
norway_dataset = dataset.loc[dataset["country"]=="Norway", :]
sweden_dataset = dataset.loc[dataset["country"]=="Sweden", :]

average_sales_finland = finland_dataset.loc[:, ["date", "num_sold"]].groupby("date").mean()
average_sales_finland.index = pd.to_datetime(average_sales_finland.index)
# average_monthly_sales_finland = average_sales_finland.resample('M').mean()
# average_monthly_sales_finland.index = pd.to_datetime(average_monthly_sales_finland.index.strftime('%Y-%m'))

average_sales_norway = norway_dataset.loc[:, ["date", "num_sold"]].groupby("date").mean()
average_sales_norway.index = pd.to_datetime(average_sales_norway.index)
# average_monthly_sales_norway = average_sales_norway.resample('M').mean()
# average_monthly_sales_norway.index = pd.to_datetime(average_monthly_sales_norway.index.strftime('%Y-%m'))

average_sales_sweden = sweden_dataset.loc[:, ["date", "num_sold"]].groupby("date").mean()
average_sales_sweden.index = pd.to_datetime(average_sales_sweden.index)
# average_monthly_sales_sweden = average_sales_sweden.resample('M').mean()
# average_monthly_sales_sweden.index = pd.to_datetime(average_monthly_sales_sweden.index.strftime('%Y-%m'))

average_sales_finland_sectioned = average_sales_finland["2017-10-01":"2018-04-01"]
average_sales_finland_sectioned.index = pd.to_datetime(average_sales_finland_sectioned.index)

average_sales_norway_sectioned = average_sales_norway["2017-10-01":"2018-04-01"]
average_sales_norway_sectioned.index = pd.to_datetime(average_sales_norway_sectioned.index)

average_sales_sweden_sectioned = average_sales_sweden["2017-10-01":"2018-04-01"]
average_sales_sweden_sectioned.index = pd.to_datetime(average_sales_sweden_sectioned.index)

fig, ax = plt.subplots(3, 2, figsize=(20, 9), gridspec_kw={'width_ratios': [4, 1]})
fig.subplots_adjust(wspace = 0.02, hspace= 0.2)
datemin = datetime(2017, 10, 1)
datemax = datetime(2018, 3, 1) 
for i in range(3):
    for j in range(2):
        ax[i][j].set_ylim(100, 1600)
        ax[i][j].set_yticks([100, 600, 1100, 1600])
#         if i!=2 and j!=1:
#         ax[i][j].set_xticks([])
        ax[i][j].margins(x=0)
        if j==1:
            ax[i][j].set_yticks([])
        for k in ['right', 'top']:
            ax[i][j].spines[k].set_visible(False)
        if j==1:
            ax[i][j].set_xticks([])
            ax[i][j].spines["left"].set_visible(False)
            ax[i][j].spines["bottom"].set_visible(False)
        for label in ax[i][j].xaxis.get_ticklabels():
            label.set_horizontalalignment('left')
            
ax[0][0].axvspan(pd.to_datetime(160, origin=pd.Timestamp('2017-10-01'), format='%Y-%m-%d'), pd.to_datetime(160, origin=pd.Timestamp('2018-04-01'), format='%Y-%m-%d'), facecolor ='#707070', alpha = 0.2)
ax[0][0].plot(average_sales_finland[:"2017-10-01"], label="Finland", c="#2B2D42", lw=1)
ax[0][0].plot(average_sales_finland["2018-04-01":], label="Finland", c="#2B2D42", lw=1)
ax[0][0].plot(average_sales_finland_sectioned, c="red", lw=1)
ax[0][0].plot(average_sales_norway, label="Norway", c="#707070", alpha=0.3, lw=0.5)
ax[0][0].plot(average_sales_sweden, label="Sweden", c="#707070", alpha=0.3, lw=0.5)
ax[0][0].text(datetime(2015, 1, 10), 1500, "Finland", fontsize=12, ha="left", va="center", weight="bold")
ax[0][0].text(datetime(2015, 1, 10), 1400, "Total sales in Finland\nThrough years", fontsize=10, ha="left", va="top")
# ax[0][0].scatter([datetime(2015, 12, 30), datetime(2016, 12, 31), datetime(2017, 12, 30)], average_sales_finland.loc[['2015-12-30', '2016-12-31', '2017-12-30'], :], c="red", marker="3")

ax[1][0].axvspan(pd.to_datetime(160, origin=pd.Timestamp('2017-10-01'), format='%Y-%m-%d'), pd.to_datetime(160, origin=pd.Timestamp('2018-04-01'), format='%Y-%m-%d'), facecolor ='#707070', alpha = 0.2)
ax[1][0].plot(average_sales_finland, label="Finland", c="#707070", alpha=0.3, lw=0.5)
ax[1][0].plot(average_sales_norway[:"2017-10-01"], label="Norway", c="#2B2D42", lw=1)
ax[1][0].plot(average_sales_norway["2018-04-01":], label="Norway", c="#2B2D42", lw=1)
ax[1][0].plot(average_sales_norway_sectioned, c="red", lw=1)
ax[1][0].plot(average_sales_sweden, label="Sweden", c="#707070", alpha=0.3, lw=0.5)
ax[1][0].text(datetime(2015, 1, 10), 1500, "Norway", fontsize=12, ha="left", va="center", weight="bold")
ax[1][0].text(datetime(2015, 1, 10), 1400, "Total sales in Norway\nThrough years", fontsize=10, ha="left", va="top")

ax[2][0].axvspan(pd.to_datetime(160, origin=pd.Timestamp('2017-10-01'), format='%Y-%m-%d'), pd.to_datetime(160, origin=pd.Timestamp('2018-04-01'), format='%Y-%m-%d'), facecolor ='#707070', alpha = 0.2)
ax[2][0].plot(average_sales_finland, label="Finland", c="#707070", alpha=0.3, lw=0.5)
ax[2][0].plot(average_sales_norway, label="Norway", c="#707070", alpha=0.3, lw=0.5)
ax[2][0].plot(average_sales_sweden[:"2017-10-01"], label="Sweden", c="#2B2D42", lw=1)
ax[2][0].plot(average_sales_sweden["2018-04-01":], label="Sweden", c="#2B2D42", lw=1)
ax[2][0].plot(average_sales_sweden_sectioned, c="red", lw=1)
ax[2][0].text(datetime(2015, 1, 10), 1500, "Sweden", fontsize=12, ha="left", va="center", weight="bold")
ax[2][0].text(datetime(2015, 1, 10), 1400, "Total sales in Seden\nThrough years", fontsize=10, ha="left", va="top")

ax[0][1].axvspan(pd.to_datetime(160, origin=pd.Timestamp('2017-9-25'), format='%Y-%m-%d'), pd.to_datetime(160, origin=pd.Timestamp('2018-04-5'), format='%Y-%m-%d'), ymin=-0.5, facecolor ='#707070', alpha = 0.2)
ax[0][1].plot(average_sales_finland_sectioned, c="red", lw=1)
ax[0][1].text(datetime(2017, 10, 1), 1500, "Semi-annual view", fontsize=12, ha="left", va="center", weight="bold")
ax[0][1].text(datetime(2017, 10, 1), 1400, "Oct 2017 - Mar 2018", fontsize=8, ha="left", va="top")

ax[1][1].axvspan(pd.to_datetime(160, origin=pd.Timestamp('2017-9-25'), format='%Y-%m-%d'), pd.to_datetime(160, origin=pd.Timestamp('2018-04-5'), format='%Y-%m-%d'), ymin=-0.5, facecolor ='#707070', alpha = 0.2)
ax[1][1].plot(average_sales_norway_sectioned, c="red", lw=1)
ax[1][1].text(datetime(2017, 10, 1), 1500, "Semi-annual view", fontsize=12, ha="left", va="center", weight="bold")
ax[1][1].text(datetime(2017, 10, 1), 1400, "Oct 2017 - Mar 2018", fontsize=8, ha="left", va="top")

ax[2][1].axvspan(pd.to_datetime(160, origin=pd.Timestamp('2017-9-25'), format='%Y-%m-%d'), pd.to_datetime(160, origin=pd.Timestamp('2018-04-5'), format='%Y-%m-%d'), ymin=-0.5, facecolor ='#707070', alpha = 0.2)
ax[2][1].plot(average_sales_sweden_sectioned, c="red", lw=1);
ax[2][1].text(datetime(2017, 10, 1), 1500, "Semi-annual view", fontsize=12, ha="left", va="center", weight="bold")
ax[2][1].text(datetime(2017, 10, 1), 1400, "Oct 2017 - Mar 2018", fontsize=8, ha="left", va="top");

### **Conclusion**

We found some important things, Let's list the findlings now
* **Annual seasonality** is found through data
* The highest total sale is on **29th december 2018**
* The sales are reaching **high on start of the year**
* **Finland** is the **lowest** in sales
* **Norway** is the **best seller**

These findlings could help you more on understanding how to well use the perfect model for the data. I hope the visualizations did it's job on make you feel good. Help me with how I can change these visualizations more informatical and what topics(question) I can include.