# Notebook Goal
The goal of this notebook is not to create a submission file for the competition. This notebook contains the exploratory data analysis performed by me, with aid from other developers and their source code where said and linked so. If you have any doubts about the code below, please feel free to send me a DM.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
sns.set_theme(style="white", palette="pastel")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
items_df = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
item_categories_df = pd.read_csv(
    '/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
sales_df = pd.read_csv(
'/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv')
shops_df = pd.read_csv(
'/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
test_df = pd.read_csv(
'/kaggle/input/competitive-data-science-predict-future-sales/test.csv')

# Exploratory Data Analysis

In [None]:
sales_df.info()

In [None]:
sales_df.head()

In [None]:
sales_df.describe()

In [None]:
sales_df.loc[:, 'date'] = pd.to_datetime(sales_df.date, format="%d.%m.%Y")

In [None]:
print("Timespan of the dataset: from", sales_df['date'].min().date(), "to", sales_df['date'].max().date())
print("date_block_num range: ", sales_df['date_block_num'].min(), "to", sales_df['date_block_num'].max())

In [None]:
print("There are ", len(items_df.item_category_id.unique()), " unique item categories.")
print("And a total of ", len(items_df.item_name.unique()), " unique items.")

In [None]:
#top 10 items per category
data = items_df.groupby('item_category_id').count().sort_values('item_id', ascending=False)
data = data.iloc[:10].reset_index()

f, ax = plt.subplots(figsize=(10, 6))
ax = sns.barplot(x=data.item_category_id, y=data.item_id)
plt.title("10 most frequent item category", fontsize=14)
plt.ylabel("# of items")
plt.xlabel("Category")
plt.show()

In [None]:
#most sold items and their category
data = sales_df.groupby('item_id').sum().sort_values('item_cnt_day', ascending=False).reset_index()
data = data[['item_id', 'item_cnt_day']].iloc[:10]
data['item_category'] = items_df[items_df['item_id'].isin(data['item_id'])]['item_category_id'].values

f, ax = plt.subplots(figsize=(10, 4))
ax = sns.barplot(x=data.item_id, y=data.item_cnt_day, hue=data.item_category)
plt.title("Bestsellers by item category")
plt.ylabel("# of items")
plt.xlabel("Item Id")
plt.xticks(rotation=45)
plt.legend(loc='upper left')
plt.show()

In [None]:
items_df[items_df['item_id'] == 20949]

By far, our best selling item was 20949, which a quick google translation revealed to be a corporate package of white t-shirts with astounding ~180000 items sold. Let's double check this.

In [None]:
sales_df[sales_df['item_id'] == 20949]['item_cnt_day'].sum()

In [None]:
print("Quantity of unique shops: ", len(shops_df.shop_id.unique()))

Below, we plot the distribution of sales per store for the entire period in the dataset. Notice how some stores fade into oblivion very quickly and some come out of nowhere. Also notice how there is no store that 'ressurects' after disappearing for some time, indicating that it was probably the end of that store.

In [None]:
#total volume sold by month per store
group = pd.DataFrame(sales_df.groupby(['shop_id', 'date_block_num'])['item_cnt_day'].sum().reset_index())

f, axs = plt.subplots(nrows=6, ncols=2, figsize=(14, 22), sharex=True, sharey=True)

shop_range = shops_df.shop_id.unique().reshape(12, 5)
count = 0
for i in range(6):
    for j in range(2):
        data = group[group['shop_id'].isin(shop_range[count])]
        sns.lineplot(x='date_block_num', y='item_cnt_day', hue='shop_id', data=data, ax=axs[i,j])
        count+=1
        axs[i,j].legend(bbox_to_anchor=(1, 0.5))
plt.show()

Shops 25 to 29 have a nice plot. We'll look at them in some plots below.

In [None]:
shops = shop_range[5]
data = group[group['shop_id'].isin(shops)]

f, ax = plt.subplots(figsize=(8,6))
sns.lineplot(data=data, x='date_block_num', y='item_cnt_day', hue='shop_id', ax=ax)
ax.axvline(x=12, color='orange', linestyle='dashed')
ax.axvline(x=24, color='orange', linestyle='dashed')
plt.show()

The dashed orange lines above mark date_block_num 12 and 24 (december 2013 and december 2014). We can spot a trend in sales for all stores shown: around november, sales tend to grow sharply. Aside from this, there are some spikes in sales, mostly over the course of 2013 for all stores. I googled some Russian national holidays and present below some of them, to see if there's any relation to our spikes above.

| Date        | Holiday                    | Color   |
|-------------|----------------------------|---------|
| February 23 | Defender of the Fatherland | Gray    |
| March 8     | International Women's Day  | Black   |
| May 1       | Labor Day                  | Red     |
| May 9       | Victory Day                | Red     |
| June 12     | Russia Day                 | Blue    |
| November 11 | National Unity Day         | Orange  |

In [None]:
holidays = {
    '23-02-2013':'Defender of the Fatherland',
    '08-03-2013':"Women's Day",
    '01-05-2013':'Labor Day',
    '09-05-2013':'Victory Day',
    '12-06-2013':'Russia Day',
    '11-11-2013':'National Unity Day',
    '23-02-2014':'Defender of the Fatherland',
    '08-03-2014':"Women's Day",
    '01-05-2014':'Labor Day',
    '09-05-2014':'Victory Day',
    '12-06-2014':'Russia Day',
    '11-11-2014':'National Unity Day',
    '23-02-2015':'Defender of the Fatherland',
    '08-03-2015':"Women's Day",
    '01-05-2015':'Labor Day',
    '09-05-2015':'Victory Day',
    '12-06-2015':'Russia Day'
}

holidays = pd.DataFrame(holidays.items(), columns=["date", "name"])
holidays.date = pd.to_datetime(holidays.date, format='%d-%m-%Y')
holidays = holidays.set_index(holidays.date).drop('date', axis=1)
holidays.head()

In [None]:
sales_df[sales_df.shop_id.isin(shops)].groupby(['shop_id',pd.Grouper(key='date', freq='M')]).agg({'item_cnt_day': 'sum'}).reset_index()

In [None]:
shops = shop_range[5]
data = sales_df[sales_df.shop_id.isin(shops)].groupby(['shop_id',pd.Grouper(key='date', freq='M')]).agg({'item_cnt_day': 'sum'}).reset_index()
data[(data.date.dt.month.isin(holidays.index.month)) & (data.shop_id == 25)]

In [None]:
shops = shop_range[5]
data = group[group['shop_id'].isin(shops)]

f, ax = plt.subplots(figsize=(8,6))

sns.lineplot(data=data, x='date_block_num', y='item_cnt_day', hue='shop_id', ax=ax)
ax.axvline(x=1, color='gray', linestyle='dashed', label='Fatherland', alpha=0.5) # Fatherland
ax.axvline(x=2, color='black', linestyle='dashed', label="Women's Day", alpha=0.5) # Women's Day
ax.axvline(x=4, color='red', linestyle='dashed', label="Labor&Vic. Day", alpha=0.5) # Labor Day & Victory Day
ax.axvline(x=5, color='blue', linestyle='dashed', label="Russia", alpha=0.5) # Russia Day
ax.axvline(x=10, color='orange', linestyle='dashed', label="National Unity", alpha=0.5) # National Unity
ax.axvline(x=13, color='gray', linestyle='dashed', alpha=0.5) # Fatherland
ax.axvline(x=14, color='black', linestyle='dashed', alpha=0.5) # Women's Day
ax.axvline(x=16, color='red', linestyle='dashed', alpha=0.5) # Labor Day & Victory Day
ax.axvline(x=17, color='blue', linestyle='dashed', alpha=0.5) # Russia Day
ax.axvline(x=22, color='orange', linestyle='dashed', alpha=0.5) # National Unity
ax.legend(bbox_to_anchor=(1,1))

plt.suptitle("Sales in 5 stores w/ National Holidays")
plt.show()

From this we can see that some spikes are related to months with holidays. For instance, Women's day (black) & Russia Day (blue) seems to be correlated with a growth in sales. Let's plot the mentioned holidays in a daily sales plot. For clarity, we'll limit our plot to two stores (25 and 27). However, let's take a look at a more generalized view.

In [None]:
import matplotlib.dates as mdates

date_formatter = mdates.DateFormatter('%d')

date_block_values = [2, 5]

f, axs = plt.subplots(nrows=2, ncols=1, figsize=(8, 10))
for date in date_block_values:
    ax_id = date_block_values.index(date)
    data = sales_df[(sales_df['date_block_num'] == date) & (sales_df['shop_id'].isin([25,26]))]
    data = pd.DataFrame(data.groupby([pd.Grouper(key='date', freq='D'), 'shop_id'])['item_cnt_day'].sum().reset_index())
    
    if date == 2:
        axs[ax_id].axvline(x=pd.to_datetime('08/03/2013', format='%d/%m/%Y'))
        axs[ax_id].set_title("March 2013")
    elif date == 5:
        axs[ax_id].axvline(x=pd.to_datetime('12/06/2013', format='%d/%m/%Y'))
        axs[ax_id].set_title("June 2013")
    
    sns.lineplot(data=data, x='date', y='item_cnt_day', hue='shop_id', ax=axs[ax_id])
    
    axs[ax_id].xaxis.set_major_formatter(date_formatter)
    axs[ax_id].tick_params('x', labelrotation=90)

plt.subplots_adjust(hspace=0.3)
plt.show()

There doesn't seem to exist a clear relation between these holidays and sales growth. The graphs above shows plenty of unexplained spikes before and after the holidays. 

In [None]:
shops = shop_range[5]
data = sales_df[sales_df['shop_id'].isin(shops)].groupby(['shop_id',pd.Grouper(key='date', freq='M')]).agg({'item_cnt_day': 'sum'}).reset_index()
data = data[data.date.dt.month.isin(holidays.index.month)]

f, ax = plt.subplots(figsize=(8,6))
sns.lineplot(data=data, x='date', y='item_cnt_day', hue='shop_id', ax=ax)
ax.plot_date(x=data.date, y=data.item_cnt_day)
plt.suptitle("Sales in 5 stores w/ National Holidays")
plt.tick_params('x', rotation=90)
plt.show()

On this broader view, we see that, while some stores performed better on _some_ holidays, others did not, which leaves something unexplained in our data. For now, this analysis will suffice, since it provides material for other works of research.

Now we want to check if there is a strong correlation between an item's price and its volume sold in a given month. For this, we need to create a set with the mean price of an item in a given month and the respective volume sold of that item. In the code below, we're doing that.

In [None]:
# volume sold x price per month
group = pd.DataFrame(sales_df.groupby(['date', 'item_id']).agg({'item_cnt_day': 'sum', 'item_price': 'mean'}))
group.corr()

In [None]:
shops_test_set = test_df.shop_id.unique()
products_test_set = test_df.item_id.unique()

In [None]:
print([x for x in shops_test_set if x not in shops_df.shop_id.unique()])
print([x for x in products_test_set if x not in items_df.item_id.unique()])